论文信息 - Speech2Face: Learning the Face Behind a Voice

Speech2Face: Learning the Face Behind a Voice

How much can we infer about a person’s looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/Youtube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how–-and in what manner–-our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

[1] Tae-Hyun Oh,et al. On Learning Associations of Faces and Voices , 2018, ACCV.

[2] Najim Dehak,et al. Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks , 2018, IEEE Access.

[3] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.

[4] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5] V. D. Sa. Minimizing Disagreement for Self-Supervised Classification , 2022 .

[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7] Christian A. Müller,et al. Automatic speaker age and gender recognition in the car for tailoring dialog and mobile services , 2010, INTERSPEECH.

[8] P. Denes. The Speech Chain , 1963 .

[9] Ira Kemelmacher-Shlizerman,et al. Synthesizing Obama , 2017, ACM Trans. Graph..

[10] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11] John H L Hansen,et al. Speaker height estimation from speech: Fusing spectral regression and statistical acoustic models. , 2015, The Journal of the Acoustical Society of America.

[12] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[13] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[14] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[15] Yu Zhang,et al. Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[16] Andrew Zisserman,et al. Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Radu Horaud,et al. Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[18] Andrew Zisserman,et al. Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[19] Andrew Zisserman,et al. X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[20] E. Vatikiotis-Bateson,et al. `Putting the Face to the Voice' Matching Identity across Modality , 2003, Current Biology.

[21] Jordi Torres,et al. Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[23] Phillip Isola. The Discovery of perceptual structure from visual co-occurrences in space and time , 2015 .

[24] Jean Charles Bazin,et al. Suggesting Sounds for Images from Video Collections , 2016, ECCV Workshops.

[25] Andrew Zisserman,et al. Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[26] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Jaakko Lehtinen,et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[28] Andrew Owens,et al. Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[29] Honglak Lee,et al. Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[30] Ming-Yu Liu,et al. Coupled Generative Adversarial Networks , 2016, NIPS.

[31] William T. Freeman,et al. Synthesizing Normalized Faces from Facial Identity Features , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Malcolm Slaney,et al. Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.

[33] Bhiksha Raj,et al. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces , 2018, ICLR.

[34] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[35] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Naoyuki Kanda,et al. Face-Voice Matching using Cross-modal Embeddings , 2018, ACM Multimedia.

[37] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[38] Andrew Zisserman,et al. Deep Face Recognition , 2015, BMVC.

[39] Davis E. King,et al. Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[40] H. M. J. Smith,et al. Matching novel face and voice identity using static and dynamic facial images , 2016, Attention, perception & psychophysics.

[41] John R. Smith,et al. Diversity in Faces , 2019, ArXiv.

[42] Tae-Hyun Oh,et al. Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.

[44] Carlos Busso,et al. Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks , 2018, IEEE Transactions on Affective Computing.

[45] Ira Kemelmacher-Shlizerman,et al. Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46] Antonio Torralba,et al. Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Yisong Yue,et al. A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[48] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[50] Hugo Van hamme,et al. Speaker age estimation and gender detection based on supervised Non-Negative Matrix Factorization , 2011, 2011 IEEE Workshop on Biometric Measurements and Systems for Security and Medical Applications (BIOMS).

[51] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.