Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network

Audio-visual speaker recognition is one of the tasks in the recent 2019 NIST speaker recognition evaluation (SRE). Studies in neuroscience and computer science all point to the fact that vision and auditory neural signals interact in the cognitive process. This motivated us to study a cross-modal network, namely voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.

[1]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[2]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[3]  Xiaoyong Du,et al.  Voice-Face Cross-modal Matching and Retrieval: A Benchmark , 2019, ArXiv.

[4]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  E. Vatikiotis-Bateson,et al.  `Putting the Face to the Voice' Matching Identity across Modality , 2003, Current Biology.

[6]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Lauren Mavica,et al.  Matching voice and face identity from static images. , 2013, Journal of experimental psychology. Human perception and performance.

[8]  Arif Mahmood,et al.  Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals , 2019, 2019 Digital Image Computing: Techniques and Applications (DICTA).

[9]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[10]  Elliot Singer,et al.  The 2019 NIST Audio-Visual Speaker Recognition Evaluation , 2020 .

[11]  Shuo Yang,et al.  WIDER FACE: A Face Detection Benchmark , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tae-Hyun Oh,et al.  Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Abhishek Shrivastava,et al.  SpeechMarker: A Voice Based Multi-Level Attendance Application , 2019, INTERSPEECH.

[14]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[16]  Stefanos Zafeiriou,et al.  RetinaFace: Single-stage Dense Face Localisation in the Wild , 2019, ArXiv.

[17]  Andreas Kleinschmidt,et al.  Interaction of Face and Voice Areas during Speaker Recognition , 2005, Journal of Cognitive Neuroscience.

[18]  Naoyuki Kanda,et al.  Face-Voice Matching using Cross-modal Embeddings , 2018, ACM Multimedia.

[19]  Sanjeev Khudanpur,et al.  State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18 , 2019, INTERSPEECH.

[20]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[21]  S. R. Mahadeva Prasanna,et al.  Development of Multi-Level Speech based Person Authentication System , 2017, J. Signal Process. Syst..

[22]  H. M. J. Smith,et al.  Matching novel face and voice identity using static and dynamic facial images , 2016, Attention, perception & psychophysics.

[23]  Yuxiao Hu,et al.  MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition , 2016, ECCV.

[24]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[25]  Douglas A. Reynolds,et al.  Two decades of speaker recognition evaluation at the national institute of standards and technology , 2020, Comput. Speech Lang..

[26]  Malcolm Slaney,et al.  Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.

[27]  Bhiksha Raj,et al.  Disjoint Mapping Network for Cross-modal Matching of Voices and Faces , 2018, ICLR.

[28]  Paula C. Stacey,et al.  Concordant Cues in Faces and Voices , 2016, Evolutionary Psychology.

[29]  John H. L. Hansen,et al.  I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences , 2019, INTERSPEECH.

[30]  Bin Ma,et al.  Joint Application of Speech and Speaker Recognition for Automation and Security in Smart Home , 2011, INTERSPEECH.

[31]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[32]  Sanjeev Khudanpur,et al.  Speaker Recognition for Multi-speaker Conversations Using X-vectors , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.