Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets.Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

[1]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[2]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[3]  Shmuel Peleg,et al.  Seeing Through Noise: Visually Driven Speaker Separation And Enhancement , 2017, ICASSP.

[4]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[5]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[9]  Mubarak Shah,et al.  Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[10]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[12]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Vittorio Murino,et al.  Voice Activity Detection by Upper Body Motion Analysis and Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[16]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[17]  Andrew Owens,et al.  Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[18]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[19]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[20]  Peter V. Gehler,et al.  Semantic Video CNNs Through Representation Warping , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tinne Tuytelaars,et al.  Cross-Modal Supervision for Learning Active Speaker Detection in Video , 2016, ECCV.

[23]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[24]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Andrew Zisserman,et al.  Memory-augmented Dense Predictive Coding for Video Representation Learning , 2020, ECCV.

[28]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[29]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[30]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[31]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Stefanos Zafeiriou,et al.  RetinaFace: Single-stage Dense Face Localisation in the Wild , 2019, ArXiv.

[33]  Abhishek Dutta,et al.  The VIA Annotation Software for Images, Audio and Video , 2019, ACM Multimedia.

[34]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[35]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[36]  Joon Son Chung,et al.  Signs in time: Encoding human motion as a temporal image , 2016, ArXiv.

[37]  Naji Khosravan,et al.  On Attention Modules for Audio-Visual Synchronization , 2018, CVPR Workshops.

[38]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[39]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[41]  Kristen Grauman,et al.  2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Irene Kotsia,et al.  RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[44]  B. Holden Listen and learn , 2002 .

[45]  Joon Son Chung,et al.  Who said that?: Audio-visual speaker diarisation of real-world meetings , 2019, INTERSPEECH.

[46]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[47]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Lorenzo Torresani,et al.  Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization , 2018, ArXiv.

[49]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[51]  Joon Son Chung,et al.  My lips are concealed: Audio-visual speech enhancement through obstructions , 2019, INTERSPEECH.

[52]  Arkadiusz Stopczynski,et al.  Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Chuang Gan,et al.  Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Rémi Gribonval,et al.  BSS_EVAL Toolbox User Guide -- Revision 2.0 , 2005 .

[56]  Cordelia Schmid,et al.  Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[57]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[58]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[59]  Sukhendu Das,et al.  See the Sound, Hear the Pixels , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[60]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[62]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[63]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.