Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart—without annotation—the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust.We train our method on a large-scale audio-visual dataset of talking heads ‘in the wild’, and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.

[1]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[2]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[3]  Lin-Shan Lee,et al.  Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[4]  Antonio Torralba,et al.  See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[5]  Andrew Zisserman,et al.  Turning a Blind Eye: Explicit Removal of Biases and Variation from Deep Neural Network Embeddings , 2018, ECCV Workshops.

[6]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[7]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[10]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[11]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[12]  Nasser M. Nasrabadi,et al.  Text-Independent Speaker Verification Using 3D Convolutional Neural Networks , 2017, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[13]  Trevor Darrell,et al.  Simultaneous Deep Transfer Across Domains and Tasks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[15]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[16]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[17]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[18]  Yusuke Shinohara,et al.  Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[19]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[20]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[21]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[22]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[24]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[25]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[26]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[27]  Diederik P. Kingma,et al.  Variational Recurrent Auto-Encoders , 2014, ICLR.

[28]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.