On Learning Associations of Faces and Voices

In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either visual or aural modality alone. We release our dataset of audiovisual recordings and demographic annotations of people reading out short text used in our studies.

[1]  B. Kabanoff,et al.  Eye movements in auditory space perception , 1975 .

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  B. R. Shelton,et al.  The influence of vision on the absolute identification of sound-source position , 1980, Perception & psychophysics.

[4]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[5]  William W. Gaver What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .

[6]  D. Lewkowicz,et al.  Three‐month‐old infants learn arbitrary auditory–visual pairings between voices and faces , 2001 .

[7]  B. Holden Listen and learn , 2002 .

[8]  E. Vatikiotis-Bateson,et al.  `Putting the Face to the Voice' Matching Identity across Modality , 2003, Current Biology.

[9]  D. Pisoni,et al.  Crossmodal Source Identification in Speech Perception , 2004, Ecological psychology : a publication of the International Society for Ecological Psychology.

[10]  Andreas Kleinschmidt,et al.  Interaction of Face and Voice Areas during Speaker Recognition , 2005, Journal of Cognitive Neuroscience.

[11]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  S. Campanella,et al.  Integrating face and voice in person perception , 2007, Trends in Cognitive Sciences.

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Simon Lucey,et al.  Face alignment through subspace constrained mean-shifts , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[16]  O. Pascalis,et al.  Spontaneous voice–face identity matching by rhesus monkeys for familiar conspecifics and humans , 2011, Proceedings of the National Academy of Sciences.

[17]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[18]  S. Campanella,et al.  Cross-modal interactions between human faces and voices involved in person recognition , 2011, Cortex.

[19]  Lauren Mavica,et al.  Matching voice and face identity from static images. , 2013, Journal of experimental psychology. Human perception and performance.

[20]  S. Campanella,et al.  Integrating face and voice in person perception , 2007, Trends in Cognitive Sciences.

[21]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[22]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[23]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  M. Grabowecky,et al.  Learned face–voice pairings facilitate visual search , 2015, Psychonomic bulletin & review.

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Radu Horaud,et al.  Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[28]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Leonard S. Peperkoorn,et al.  Revisiting the Red Effect on Attractiveness and Sexual Receptivity: No Effect of the Color Red on Human Mate Preferences , 2016, Evolutionary Psychology.

[30]  Jean Charles Bazin,et al.  Suggesting Sounds for Images from Video Collections , 2016, ECCV Workshops.

[31]  Paula C. Stacey,et al.  Concordant Cues in Faces and Voices , 2016 .

[32]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[33]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[34]  Chen Huang,et al.  Local Similarity-Aware Deep Feature Embedding , 2016, NIPS.

[35]  H. M. J. Smith,et al.  Matching novel face and voice identity using static and dynamic facial images , 2016, Attention, perception & psychophysics.

[36]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[38]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Victor S. Lempitsky,et al.  Learning Deep Embeddings with Histogram Loss , 2016, NIPS.

[40]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Kaiqi Huang,et al.  Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Malcolm Slaney,et al.  Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.

[44]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[45]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[46]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[47]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[50]  Larry S. Davis,et al.  Deception Detection in Videos , 2017, AAAI.

[51]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.