Speech2Face: Learning the Face Behind a Voice

How much can we infer about a person’s looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/Youtube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how–-and in what manner–-our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

[1]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[2]  Najim Dehak,et al.  Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks , 2018, IEEE Access.

[3]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  V. D. Sa Minimizing Disagreement for Self-Supervised Classification , 2022 .

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Christian A. Müller,et al.  Automatic speaker age and gender recognition in the car for tailoring dialog and mobile services , 2010, INTERSPEECH.

[8]  P. Denes The Speech Chain , 1963 .

[9]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[10]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  John H L Hansen,et al.  Speaker height estimation from speech: Fusing spectral regression and statistical acoustic models. , 2015, The Journal of the Acoustical Society of America.

[12]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[13]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[14]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[15]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[16]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Radu Horaud,et al.  Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[18]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[19]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[20]  E. Vatikiotis-Bateson,et al.  `Putting the Face to the Voice' Matching Identity across Modality , 2003, Current Biology.

[21]  Jordi Torres,et al.  Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[23]  Phillip Isola The Discovery of perceptual structure from visual co-occurrences in space and time , 2015 .

[24]  Jean Charles Bazin,et al.  Suggesting Sounds for Images from Video Collections , 2016, ECCV Workshops.

[25]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[26]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[28]  Andrew Owens,et al.  Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[29]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[30]  Ming-Yu Liu,et al.  Coupled Generative Adversarial Networks , 2016, NIPS.

[31]  William T. Freeman,et al.  Synthesizing Normalized Faces from Facial Identity Features , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Malcolm Slaney,et al.  Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.

[33]  Bhiksha Raj,et al.  Disjoint Mapping Network for Cross-modal Matching of Voices and Faces , 2018, ICLR.

[34]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[35]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Naoyuki Kanda,et al.  Face-Voice Matching using Cross-modal Embeddings , 2018, ACM Multimedia.

[37]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[38]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[39]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[40]  H. M. J. Smith,et al.  Matching novel face and voice identity using static and dynamic facial images , 2016, Attention, perception & psychophysics.

[41]  John R. Smith,et al.  Diversity in Faces , 2019, ArXiv.

[42]  Tae-Hyun Oh,et al.  Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[44]  Carlos Busso,et al.  Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks , 2018, IEEE Transactions on Affective Computing.

[45]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Antonio Torralba,et al.  Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[48]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[50]  Hugo Van hamme,et al.  Speaker age estimation and gender detection based on supervised Non-Negative Matrix Factorization , 2011, 2011 IEEE Workshop on Biometric Measurements and Systems for Security and Medical Applications (BIOMS).

[51]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.