Cross Modal Video Representations for Weakly Supervised Active Speaker Localization

—An objective understanding of media depictions, such as inclusive portrayals of how much someone is heard and seen on screen such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking, and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. This is however a challenging problem due to the vast variety and contextual variability in the media content, and the lack of labeled data. In this work, we present a cross-modal neural network for learning visual representations, which have implicit information pertaining to the spatial location of a speaker in the visual frames. Avoiding the need for manual annotations for active speakers in visual frames, acquiring of which is very expensive, we present a weakly supervised system for the task of localizing active speakers in movie content. We use the learned cross-modal visual representations, and provide weak supervision from movie subtitles acting as a proxy for voice activity, thus requiring no manual annotations. We evaluate the performance of the proposed system on the AVA active speaker dataset and demonstrate the effectiveness of the cross-modal embeddings for localizing active speakers in comparison to fully supervised systems. We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework, especially when speech is accompanied by noise and music.

[1]  Juan Leon Alcazar,et al.  End-to-End Active Speaker Detection , 2022, ECCV.

[2]  Fabian Caba Heilbron,et al.  MovieCuts: A New Dataset and Benchmark for Cut Type Recognition , 2021, ECCV.

[3]  Yuki M. Asano,et al.  Self-supervised object detection from audio-visual correspondence , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Vicky S. Kalogeiton,et al.  Face, Body, Voice: Video Person-Clustering with Multiple Modalities , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[5]  Shrikanth S. Narayanan,et al.  Computational Media Intelligence: Human-Centered Machine Analysis of Media , 2021, Proceedings of the IEEE.

[6]  Bernard Ghanem,et al.  MAAS: Multi-modal Assignation for Active Speaker Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Vittorio Murino,et al.  S-VVAD: Visual Voice Activity Detection by Motion Segmentation , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  Bin Wu,et al.  A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks , 2021, Inf. Process. Manag..

[9]  Runhao Zeng,et al.  Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization , 2020, ACM Multimedia.

[10]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[11]  Chenliang Xu,et al.  Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.

[12]  Irene Kotsia,et al.  RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bernard Ghanem,et al.  Active Speakers in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yong Jae Lee,et al.  Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Pavel Korshunov,et al.  Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Du Tran,et al.  What Makes Training Multi-Modal Classification Networks Hard? , 2019, Computer Vision and Pattern Recognition.

[17]  Arkadiusz Stopczynski,et al.  Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Shrikanth Narayanan,et al.  Toward Visual Voice Activity Detection for Unconstrained Videos , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[19]  Joon Son Chung Naver at ActivityNet Challenge 2019 - Task B Active Speaker Detection (AVA) , 2019, ArXiv.

[20]  Shrikanth Narayanan,et al.  Robust Speech Activity Detection in Movie Audio: Data Resources and Experimental Evaluation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Chang Liu,et al.  C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Florian Metze,et al.  A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  S. Shan,et al.  Multi-Task Learning for Audio-Visual Active Speaker Detection , 2019 .

[25]  Naveen Kumar,et al.  Multimodal Representation of Advertisements Using Segment-level Autoencoders , 2018, ICMI.

[26]  Daniel P. W. Ellis,et al.  AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies , 2018, INTERSPEECH.

[27]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[28]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[29]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[30]  Anirban Sarkar,et al.  Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[31]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Abhinav Gupta,et al.  A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Wenyu Liu,et al.  Multiple Instance Detection Network with Online Instance Classifier Refinement , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[37]  Hugo Van hamme,et al.  Active speaker detection with audio-visual co-training , 2016, ICMI.

[38]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[39]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Dragomir Anguelov,et al.  Self-taught object localization with deep networks , 2014, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[42]  Hugo Van hamme,et al.  Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video , 2015, ICMI.

[43]  Erica Klarreich,et al.  Hello, my name is… , 2014, CACM.

[44]  Benjamin Schrauwen,et al.  Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[45]  Claudia Freigang,et al.  Crossmodal interactions and multisensory integration in the perception of audio-visual motion — A free-field study , 2012, Brain Research.

[46]  Christopher D. Chambers,et al.  Current perspectives and methods in studying neural mechanisms of multisensory interactions , 2012, Neuroscience & Biobehavioral Reviews.

[47]  L. Shams,et al.  Crossmodal influences on visual perception. , 2010, Physics of life reviews.

[48]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[50]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[51]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.