论文信息 - Deep multi-metric learning for text-independent speaker verification

Deep multi-metric learning for text-independent speaker verification

Abstract Text-independent speaker verification is an important artificial intelligence problem that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. The purpose of text-independent speaker verification is to determine whether two given uncontrolled utterances originate from the same speaker or not. Extracting speech features for each speaker using deep neural networks is a promising direction to explore and a straightforward solution is to train the discriminative feature extraction network by using a metric learning loss function. However, a single loss function often has certain limitations. Thus, we use deep multi-metric learning to address the problem and introduce three different losses for this problem, i.e., triplet loss, n-pair loss and angular loss. The three loss functions work in a cooperative way to train a feature extraction network equipped with Residual connections and squeeze-and-excitation attention. We conduct experiments on the large-scale VoxCeleb2 dataset, which contains over a million utterances from over 6 , 000 speakers, and the proposed deep neural network obtains an equal error rate of 3.48 % , which is a very competitive result. Codes for both training and testing and pretrained models are available at https://github.com/GreatJiweix/DmmlTiSV , which is the first publicly available code repository for large-scale text-independent speaker verification with performance on par with the state-of-the-art systems.

Xinggang Wang | Wenyu Liu | Bin Feng | Jiwei Xu

[1] Cheng Wang,et al. Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification , 2018, ECCV.

[2] Guillermo Sapiro,et al. Computer vision analysis captures atypical attention in toddlers with autism , 2019, Autism : the international journal of research and practice.

[3] Tara N. Sainath,et al. Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] John H. L. Hansen,et al. Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Quan Wang,et al. Attention-Based Models for Text-Dependent Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Zhixin Li,et al. Sparse High-Level Attention Networks for Person Re-Identification , 2019, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI).

[7] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[8] Yongxia Zhou,et al. Classification of breast cancer histopathological images using interleaved DenseNet with SENet (IDSNet) , 2020, PloS one.

[9] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[11] Junbin Gao,et al. Probabilistic Linear Discriminant Analysis With Vectorial Representation for Tensor Data , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[12] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[13] Themos Stafylakis,et al. Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[14] Khaled Shaalan,et al. Speech Recognition Using Deep Neural Networks: A Systematic Review , 2019, IEEE Access.

[15] Yifan Gong,et al. CNN with Phonetic Attention for Text-Independent Speaker Verification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16] John H. L. Hansen,et al. Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Gustavo Carneiro,et al. A Theoretically Sound Upper Bound on the Triplet Loss for Improving the Efficiency of Deep Distance Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Jiasong Sun,et al. Angular Softmax Loss for End-to-end Speaker Verification , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[20] Yu Qiao,et al. A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[21] Sarthak Yadav,et al. Learning Discriminative Features for Speaker Identification and Verification , 2018, INTERSPEECH.

[22] John H. L. Hansen,et al. Robust speech recognition in noise: an evaluation using the SPINE corpus , 2001, INTERSPEECH.

[23] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[24] Bernard Ghanem,et al. 3D Instance Segmentation via Multi-Task Metric Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26] Patrick Kenny,et al. A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[27] Bin Ma,et al. Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[28] Tiago H. Falk,et al. Combining Speaker Recognition and Metric Learning for Speaker-Dependent Representation Learning , 2019, INTERSPEECH.

[29] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Yifan Gong,et al. End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[31] Thomas Wolf,et al. Transfer Learning in Natural Language Processing , 2019, NAACL.

[32] Ping Li,et al. Improving person re-identification by multi-task learning , 2019, Multimedia Tools and Applications.

[33] Xiaofang Liu,et al. DPGAN: PReLU Used in Deep Convolutional Generative Adversarial Networks , 2019, RSVT '19.

[34] Erik McDermott,et al. Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Thierry Poibeau,et al. Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing , 2018, Computational Linguistics.

[36] Wensheng Yan,et al. Deep Residual SENet for Foliage Recognition , 2020, Trans. Edutainment.

[37] Ekin D. Cubuk,et al. A Fourier Perspective on Model Robustness in Computer Vision , 2019, NeurIPS.

[38] Patrick Kenny,et al. Deep Speaker Recognition: Modular or Monolithic? , 2019, INTERSPEECH.

[39] Jian Wang,et al. Deep Metric Learning with Angular Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] John H. L. Hansen,et al. High performance digit recognition in real car environments , 2002, INTERSPEECH.

[41] Philip S. Yu,et al. Multi-task Network Embedding , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[42] Carlos D. Castillo,et al. An Experimental Evaluation of Covariates Effects on Unconstrained Face Verification , 2018, IEEE Transactions on Biometrics, Behavior, and Identity Science.

[43] Shuai Wang,et al. Angular Softmax for Short-Duration Text-independent Speaker Verification , 2018, INTERSPEECH.

[44] Patrick Kenny,et al. Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[45] Chengzhu Yu,et al. Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Weihong Deng,et al. Deep embedding learning with adaptive large margin N-pair loss for image retrieval and clustering , 2019, Pattern Recognit..

[47] Dong Yu,et al. Deep Discriminative Embeddings for Duration Robust Speaker Verification , 2018, INTERSPEECH.

[48] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[49] Oliver Durr,et al. Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[50] Koichi Shinoda,et al. Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[51] Haiyang Wang,et al. A modified contrastive loss method for face recognition , 2019, Pattern Recognit. Lett..

[52] P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[53] Wei-Qiang Zhang,et al. Towards Discriminative Representations and Unbiased Predictions: Class-Specific Angular Softmax for Speech Emotion Recognition , 2019, INTERSPEECH.

[54] Ming Li,et al. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.