MIT researchers developed a machine-learning technique that learns to represent data in a way that captures concepts shared between visual and audio modalities. Their model can identify where certain action is taking place in a video and label it.
— Read on news.mit.edu/2022/ai-video-audio-text-connections-0504