Audio-Visual Deep Neural Network for Robust Person Verification

Voice and face are two most popular biometrics for person verification, usually used in speaker verification and face verification tasks. It has already been observed that simply combining the information from these two modalities can lead to a more powerful and robust person verification system. In this article, to fully explore the multi-modal learning strategies for person verification, we proposed three types of audio-visual deep neural network (AVN), including feature level AVN (AVN-F), embedding level AVN (AVN-E), and embedding level combination with joint learning AVN (AVN-J). To further enhance the system robustness in real noisy conditions where not both modalities can be accessed with high-quality, we proposed several data augmentation strategies for each proposed AVN: A feature-level multi-modal data augmentation is proposed for AVN-F and an embedding-level data augmentation with novel noise distribution matching is designed for AVN-E. For AVN-J, both the feature and embedding level multi-modal data augmentation methods can be applied. All the proposed models are trained on the VoxCeleb2 dev dataset and evaluated on the standard VoxCeleb1 dataset, and the best system achieves 0.558, 0.441% and 0.793% EER on the three official trial lists of VoxCeleb1, which is to our knowledge the best published single system results on this corpus for person verification. To validate the robustness of the proposed approaches, a noisy evaluation set based on the VoxCeleb1 is constructed, and experimental results show that the proposed system can significantly boost the system robustness and still show promising performance under this noisy scenario.

[1]  Shuai Wang,et al.  Generative Adversarial Networks based X-vector Augmentation for Robust Probabilistic Linear Discriminant Analysis in Speaker Verification , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[2]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Charles C. Broun,et al.  Using lip features for multimodal speaker verification , 2001, Odyssey.

[6]  Kai Yu,et al.  Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Naoyuki Kanda,et al.  Face-Voice Matching using Cross-modal Embeddings , 2018, ACM Multimedia.

[9]  Chenda Li,et al.  Deep Audio-Visual Speech Separation with Attention Mechanism , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tae-Hyun Oh,et al.  Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[13]  Yanmin Qian,et al.  Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation , 2020, INTERSPEECH.

[14]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[18]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[19]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[21]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[24]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[25]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[26]  Matti Pietikäinen,et al.  Learning Discriminant Face Descriptor , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Shuai Wang,et al.  BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[28]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[30]  Yanmin Qian,et al.  Data Augmentation Using Deep Generative Models for Embedding Based Speaker Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Xiaogang Wang,et al.  Deep Learning Face Representation by Joint Identification-Verification , 2014, NIPS.

[32]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[33]  Muhammad Haroon Yousaf,et al.  Cross-modal Speaker Verification and Recognition: A Multilingual Perspective , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[35]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[36]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[37]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[39]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Shuai Wang,et al.  Multi-Modality Matters: A Performance Leap on VoxCeleb , 2020, INTERSPEECH.

[41]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[42]  Shuai Wang,et al.  Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[43]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[45]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[46]  Themos Stafylakis,et al.  Analysis of ABC Submission to NIST SRE 2019 CMN and VAST Challenge , 2020 .

[47]  Xiaogang Wang,et al.  Deeply learned face representations are sparse, selective, and robust , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Kevin Duh,et al.  Audio-Visual Person Recognition in Multimedia Data From the Iarpa Janus Program , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Xiaogang Wang,et al.  DeepID3: Face Recognition with Very Deep Neural Networks , 2015, ArXiv.

[50]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[51]  Jian Sun,et al.  Face recognition with learning-based descriptor , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Elliot Singer,et al.  The 2019 NIST Audio-Visual Speaker Recognition Evaluation , 2020 .

[53]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.