Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

We propose a novel framework, called Disjoint Mapping Network (DIMNet), for cross-modal biometric matching, in particular of voices and faces. Different from the existing methods, DIMNet does not explicitly learn the joint relationship between the modalities. Instead, DIMNet learns a shared representation for different modalities by mapping them individually to their common covariates. These shared representations can then be used to find the correspondences between the modalities. We show empirically that DIMNet is able to achieve better performance than other current methods, with the additional benefits of being conceptually simpler and less data-intensive.

[1]  A. W. Ellis,et al.  NEURO-COGNITIVE PROCESSING OF FACES AND VOICES , 1989 .

[2]  Eun Yong Kang,et al.  Identification of individuals by trait prediction using whole-genome sequencing data , 2017, Proceedings of the National Academy of Sciences.

[3]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[4]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[5]  P. Belin,et al.  Understanding voice perception. , 2011, British journal of psychology.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[8]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[9]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[10]  E. Vatikiotis-Bateson,et al.  `Putting the Face to the Voice' Matching Identity across Modality , 2003, Current Biology.

[11]  Meng Yang,et al.  Large-Margin Softmax Loss for Convolutional Neural Networks , 2016, ICML.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  P. Belin,et al.  Thinking the voice: neural correlates of voice perception , 2004, Trends in Cognitive Sciences.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Allyssa McCabe,et al.  When Eyewitnesses Are Also Earwitnesses: Effects on Visual and Voice Identifications , 1993 .

[16]  Bhiksha Raj,et al.  Optimal Strategies for Matching and Retrieval Problems by Comparing Covariates , 2018, ArXiv.

[17]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[19]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Evaluation in information retrieval , 2008 .

[21]  S. Schweinberger,et al.  Hearing facial identities: Brain correlates of face–voice integration in person identification , 2011, Cortex.

[22]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[23]  E. Rasmussen Evaluation in Information Retrieval , 2002 .

[24]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[25]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[26]  S. Schweinberger,et al.  Hearing Facial Identities , 2007, Quarterly journal of experimental psychology.