HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

This work describes the speaker verification system developed by Human Language Technology Laboratory, National University of Singapore (HLT-NUS) for 2019 NIST Multimedia Speaker Recognition Evaluation (SRE). The multimedia research has gained attention to a wide range of applications and speaker recognition is no exception to it. In contrast to the previous NIST SREs, the latest edition focuses on a multimedia track to recognize speakers with both audio and visual information. We developed separate systems for audio and visual inputs followed by a score level fusion of the systems from the two modalities to collectively use their information. The audio systems are based on x-vector based speaker embedding, whereas the face recognition systems are based on ResNet and InsightFace based face embeddings. With post evaluation studies and refinements, we obtain an equal error rate (EER) of 0.88% and an actual detection cost function (actDCF) of 0.026 on the evaluation set of 2019 NIST multimedia SRE corpus.

[1]  Douglas E. Sturim,et al.  The MIT-LL, JHU and LRDE NIST 2016 Speaker Recognition Evaluation System , 2017, INTERSPEECH.

[2]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[3]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Hitoshi Yamamoto,et al.  NEC-TT System for Mixed-Bandwidth and Multi-Domain Speaker Recognition , 2020, Comput. Speech Lang..

[5]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[6]  Stefanos Zafeiriou,et al.  RetinaFace: Single-stage Dense Face Localisation in the Wild , 2019, ArXiv.

[7]  Seyed Omid Sadjadi,et al.  The 2019 NIST Speaker Recognition Evaluation CTS Challenge , 2020, Odyssey.

[8]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[9]  S. R. Mahadeva Prasanna,et al.  IITG-Indigo System for NIST 2016 SRE Challenge , 2017, INTERSPEECH.

[10]  Abhishek Shrivastava,et al.  SpeechMarker: A Voice Based Multi-Level Attendance Application , 2019, INTERSPEECH.

[11]  Benoît Maison,et al.  Audio-visual speaker recognition for video broadcast news: some fusion techniques , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[12]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[13]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[14]  Sanjeev Khudanpur,et al.  Speaker Recognition for Multi-speaker Conversations Using X-vectors , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Malcolm Slaney,et al.  Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.

[17]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[18]  James R. Hopgood,et al.  Robust indoor speaker recognition in a network of audio and video sensors , 2016, Signal Process..

[19]  S. R. Mahadeva Prasanna,et al.  Investigating Text-independent Speaker Verification from Practically Realizable System Perspective , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[20]  Shuo Yang,et al.  WIDER FACE: A Face Detection Benchmark , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Sanjeev Khudanpur,et al.  State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18 , 2019, INTERSPEECH.

[22]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[23]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[24]  Chalapathy Neti,et al.  Audio-visual speaker recognition using time-varying stream reliability prediction , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[25]  Elliot Singer,et al.  The 2019 NIST Audio-Visual Speaker Recognition Evaluation , 2020 .

[26]  S. R. Mahadeva Prasanna,et al.  Speech biometric based attendance system , 2014, 2014 Twentieth National Conference on Communications (NCC).

[27]  Michael Wagner,et al.  Audio Visual Speaker Verification Based on Hybrid Fusion of Cross Modal Features , 2007, PReMI.

[28]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[29]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[30]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  S. R. Mahadeva Prasanna,et al.  Development and evaluation of online text-independent speaker verification system for remote person authentication , 2013, Int. J. Speech Technol..

[34]  John H. L. Hansen,et al.  I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences , 2019, INTERSPEECH.

[35]  Bin Ma,et al.  Joint Application of Speech and Speaker Recognition for Automation and Security in Smart Home , 2011, INTERSPEECH.

[36]  S. R. Mahadeva Prasanna,et al.  Development of Multi-Level Speech based Person Authentication System , 2017, J. Signal Process. Syst..

[37]  Haizhou Li,et al.  Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network , 2020, INTERSPEECH.

[38]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[39]  Yuxiao Hu,et al.  MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition , 2016, ECCV.

[40]  Douglas A. Reynolds,et al.  Two decades of speaker recognition evaluation at the national institute of standards and technology , 2020, Comput. Speech Lang..