Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.

[1]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[2]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[3]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[5]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[9]  Ricardo Vigário,et al.  Self-Supervised MRI Tissue Segmentation by Discriminative Clustering , 2014, Int. J. Neural Syst..

[10]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[12]  Aggelos K. Katsaggelos,et al.  Audio-Visual Biometrics , 2006, Proceedings of the IEEE.

[13]  P. Belin,et al.  Thinking the voice: neural correlates of voice perception , 2004, Trends in Cognitive Sciences.

[14]  Xiaolin Hu,et al.  Recurrent convolutional neural network for object recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[16]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[18]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Joon Son Chung,et al.  Learning to lip read words by watching videos , 2018, Comput. Vis. Image Underst..

[20]  Ke Gong,et al.  Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[24]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[25]  L. Shams,et al.  Crossmodal influences on visual perception. , 2010, Physics of life reviews.

[26]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[27]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Lorenzo Torresani,et al.  Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization , 2018, ArXiv.

[29]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Joon Son Chung,et al.  Signs in time: Encoding human motion as a temporal image , 2016, ArXiv.

[31]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[33]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[36]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[39]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[41]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[42]  Tae-Hyun Oh,et al.  Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).