AI system learns shared concepts between video, audio and text | MIT News

Humans observe the world through a combination of different modalities, such as vision, hearing, and our understanding of language. Machines, on the other hand, interpret the world through data that algorithms can process.

So when a machine “sees” a photo, it must encode that photo into data that it can use to perform a task such as image classification. This process becomes more complicated when the inputs come in multiple formats, like videos, audio clips, and images.

“The main challenge here is how can a machine align these different modalities? As humans, it’s easy for us. We see a car, then we hear the sound of a passing car, and we know it’s the same thing. But for machine learning, it’s not that simple,” says Alexander Liu, a graduate student at the Computer Science and Artificial Intelligence Laboratory (CSAIL) and first author of a paper addressing this issue.

Liu and his collaborators have developed an artificial intelligence technique that learns to represent data in a way that captures shared concepts between visual and audio modalities. For example, their method can learn that the action of a crying baby in a video is related to the spoken word “cry” in an audio clip.

Using this knowledge, their machine learning model can identify where a certain action is taking place in a video and label it.

It performs better than other machine learning methods in cross-modal retrieval tasks, which involve finding a piece of data, like a video, that matches a user’s query given in another form, like spoken language. Their model also makes it easier for users to see why the machine thinks the video it retrieved matches their query.

This technique could one day be used to help robots learn concepts of the world through perception, much like humans do.

Joining Liu on the paper are CSAIL postdoc SouYoung Jin; Cheng-I graduate students Jeff Lai and Andrew Rouditchenko; Aude Oliva, senior researcher at CSAIL and director of the MIT-IBM Watson AI Lab at MIT; and lead author James Glass, Senior Research Scientist and Head of the Spoken Language Systems Group at CSAIL. The research will be presented at the annual meeting of the Association for Computational Linguistics.

Training representations

The researchers focus their work on representation learning, which is a form of machine learning that seeks to transform input data to help perform a task like classification or prediction.

The representation learning model takes raw data, such as videos and their corresponding text captions, and encodes them by extracting features or observations about objects and actions in the video. Then it maps those data points into a grid, called the integration space. The pattern groups similar data together as single points in the grid. Each of these data points, or vectors, is represented by an individual word.

For example, a video clip of a person juggling might be mapped to a vector labeled “juggling”.

The researchers limit the model so that it can only use 1,000 words to label the vectors. The model can decide which actions or concepts it wants to encode in a single vector, but it can only use 1000 vectors. The model chooses the words that it thinks best represent the data.

Rather than encoding data from different modalities on separate grids, their method uses a shared integration space where two modalities can be encoded together. This allows the model to learn the relationship between representations of two modalities, such as the video that shows a person juggling and an audio recording of someone saying “juggles”.

To help the system process data from multiple modalities, they designed an algorithm that guides the machine to encode similar concepts into the same vector.

“If there is a pig video, the model can assign the word ‘pig’ to one of 1,000 vectors. Then, if the model hears someone say the word ‘pig’ in an audio clip, it should always use the same vector to encode it,” Liu explains.

A better catcher

They tested the model on cross-modal retrieval tasks using three datasets: a video-text dataset with video clips and text captions, a video-audio dataset with video clips, and spoken audio captions, and an image-to-audio dataset with images and spoken audio. legends.

For example, in the video-audio dataset, the model chose 1000 words to represent the actions in the videos. Then, when the researchers fed it audio queries, the model tried to find the clip that best matched those spoken words.

“Just like a Google search, you type in some text and the machine tries to tell you the most relevant things you’re looking for. We’re the only ones doing this in vector space,” Liu says.

Not only was their technique more likely to find better matches than the models they compared it to, but it’s also easier to understand.

Since the model could only use 1,000 total words to label the vectors, a user can more easily see which words the machine used to conclude that the video and the spoken words are similar. This could make the model easier to apply in real-world situations where it’s critical for users to understand how it makes decisions, Liu says.

The model still has some limitations that they hope to address in future work. For one thing, their research focused on data from two modalities at once, but in the real world, humans encounter many data modalities simultaneously, Liu says.

“And we know that 1,000 words works on this kind of data set, but we don’t know if that can be generalized to a real-world problem,” he adds.

Additionally, the images and videos in their datasets contained simple objects or direct actions; real-world data is much messier. They also want to determine how well their method scales when there is a greater diversity of inputs.

This research was supported, in part, by the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside, and by the MIT Lincoln Laboratory.

Leave a Comment