Tuesday, March 2, 2021

MIT Moments in Time Dataset

 I saw the moments in time data-set at MIT CSAIL which Jim asked me to have a look at. They are talking about using the video clips to recognize and understand actions and events in videos. They give an example - "opening" of drawers, boxes, curtains, eyes etc. 


Now, these are multiple representations of the same word - in particular, an abstract word like "opening" here. So can this system be said to have an understanding of the meaning of the word "opening"? In that case, why not use this data-set to help computers understand language - meanings of words? Show them 4-5 instances of "opening" (the examples which they have given - curtains, eyes, flower petals etc.) or say, "consumption" (say, mangoes by a person, fuel by engine, sugar in stores) or "absorption" (say, water by a sponge, words by a mind, rays by a shield etc.), and the common aspect of the set is the meaning of the word !!

Consider for example - 'consumption'. After seeing the 3 videos described in the examples in parentheses above - mangoes, fuel and sugar - the machine should understand that the "going in of the mangoes", the "drawing forth and burning of the fuel by the engine" and the "emptying of the sugar from the reservoir and going into usage by the consumers", which all effectively amounts to "something being displaced or drawn forth or taken, for usage in some form" is what is 'consumption'. This gives the machine the real "understanding" of the word 'consumption'. Minsky says that you don't understand anything until you understand it in at least 3 ways. This is an exact match-case of that!

In general (and not just actions and events) abstract words can be taught to the machine this way. Abstract words are understood by demonstrations on physical/tangible words. Abstract word-meanings are mappings from the word onto events in the world. For example, to teach someone, what 'music' means, you expose him to music and say "what you are hearing now is music" or "You hearing = music". The machine understands 'hearing' from this data-set and knows concrete/physical/tangible words like 'you' (required for this data-set). So representations of creation / exposure to / consumption of music (supplemented with descriptive instructions) would teach the machine the meaning of the abstract word 'music'.

After the physical (like computer) and abstract words (opening, hearing, music etc.) are taken care of this way, what remain is the connector words in language - in, and, to, upon, at, if, for etc. These can be taught with multiple exemplary sentences (fragments, as are there in the data-sets) and the corresponding videos. For example, how do you teach someone, the meaning of 'If'? You show the machine videos of - someone hitting someone and the latter crying, someone breaking a glass by hitting it, someone touching a hot stove and his hand burning, and supplement these clips with sentence-fragments like "IF hit person, person cry", "IF hit glass, glass break" and "IF touch hand, hand burn". The machine knows the meanings of all these words (tangible an abstract). So what remains is 'IF', mapped onto the aspect of conditionality/causality in each. So being shown multiple instances, it can be said to get a "sense" of 'IF'.

Labels:

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home