Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage: this http URL

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Yoshitaka Nakajima,et al.  Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[3]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[4]  Virginia R. de Sa,et al.  Learning Classification with Unlabeled Data , 1993, NIPS.

[5]  R. Sekuler,et al.  Sound alters visual motion perception , 1997, Nature.

[6]  David F. McAllister,et al.  Lip synchronization for animation , 1997, SIGGRAPH '97.

[7]  David F. McAllister,et al.  Lip synchronization of speech , 1997, AVSP.

[8]  Jon Barker,et al.  Is Primitive AV Coherence An Aid To Segment The Scene? , 1998, AVSP.

[9]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[10]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[11]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[12]  Trevor Darrell,et al.  Ausio-visual Segmentation and "The Cocktail Party Effect" , 2000, ICMI.

[13]  S. Shimojo,et al.  Sensory modalities are not separate modalities: plasticity and interactions , 2001, Current Opinion in Neurobiology.

[14]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[15]  Frédéric Berthommier,et al.  Audio-visual scene analysis: evidence for a "very-early" integration process in audio-visual speech perception , 2002, INTERSPEECH.

[16]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[17]  Nebojsa Jojic,et al.  Audio-visual graphical models for speech processing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Michael Gasser,et al.  The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.

[19]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[22]  K. Omata,et al.  Fusion and combination in audio-visual integration , 2008, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[23]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[26]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[27]  Pierre Vandergheynst,et al.  Blind Audiovisual Source Separation Based on Sparse Redundant Representations , 2010, IEEE Transactions on Multimedia.

[28]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[29]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[30]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[31]  Frédéric Berthommier,et al.  Binding and unbinding the auditory and visual streams in the McGurk effect. , 2012, The Journal of the Acoustical Society of America.

[32]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[33]  Faheem Khan,et al.  Speaker separation using visually-derived binary masks , 2013, AVSP.

[34]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[35]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[36]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[38]  Jonathon A. Chambers,et al.  Audiovisual Speech Source Separation: An overview of key methodologies , 2014, IEEE Signal Processing Magazine.

[39]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Edward H. Adelson,et al.  Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[41]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[42]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[43]  Vaibhava Goel,et al.  Detecting audio-visual synchrony using deep neural networks , 2015, INTERSPEECH.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Frédéric Berthommier,et al.  Audio-visual speech scene analysis: characterization of the dynamics of unbinding and rebinding the McGurk effect. , 2015, The Journal of the Acoustical Society of America.

[46]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[49]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[54]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[55]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[56]  Abhinav Gupta,et al.  Pose from Action: Unsupervised Learning of Pose Features based on Motion , 2016, ArXiv.

[57]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Maja Pantic,et al.  Audio-visual object localization and separation using low-rank and sparsity , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Shmuel Peleg,et al.  Seeing Through Noise: Speaker Separation and Enhancement using Visually-derived Speech , 2017, ArXiv.

[60]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[61]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[65]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[66]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Shmuel Peleg,et al.  Visual Speech Enhancement using Noise-Invariant Training , 2017, ArXiv.

[72]  Andrew Owens,et al.  Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[73]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[74]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[75]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[76]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[77]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[79]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[80]  Shmuel Peleg,et al.  Seeing Through Noise: Visually Driven Speaker Separation And Enhancement , 2017, ICASSP.

[81]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[82]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[83]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.