Deep multi-metric learning for text-independent speaker verification

Abstract Text-independent speaker verification is an important artificial intelligence problem that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. The purpose of text-independent speaker verification is to determine whether two given uncontrolled utterances originate from the same speaker or not. Extracting speech features for each speaker using deep neural networks is a promising direction to explore and a straightforward solution is to train the discriminative feature extraction network by using a metric learning loss function. However, a single loss function often has certain limitations. Thus, we use deep multi-metric learning to address the problem and introduce three different losses for this problem, i.e., triplet loss, n-pair loss and angular loss. The three loss functions work in a cooperative way to train a feature extraction network equipped with Residual connections and squeeze-and-excitation attention. We conduct experiments on the large-scale VoxCeleb2 dataset, which contains over a million utterances from over 6 , 000 speakers, and the proposed deep neural network obtains an equal error rate of 3.48 % , which is a very competitive result. Codes for both training and testing and pretrained models are available at https://github.com/GreatJiweix/DmmlTiSV , which is the first publicly available code repository for large-scale text-independent speaker verification with performance on par with the state-of-the-art systems.

[1]  Cheng Wang,et al.  Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification , 2018, ECCV.

[2]  Guillermo Sapiro,et al.  Computer vision analysis captures atypical attention in toddlers with autism , 2019, Autism : the international journal of research and practice.

[3]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  John H. L. Hansen,et al.  Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Quan Wang,et al.  Attention-Based Models for Text-Dependent Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Zhixin Li,et al.  Sparse High-Level Attention Networks for Person Re-Identification , 2019, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI).

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Yongxia Zhou,et al.  Classification of breast cancer histopathological images using interleaved DenseNet with SENet (IDSNet) , 2020, PloS one.

[9]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[11]  Junbin Gao,et al.  Probabilistic Linear Discriminant Analysis With Vectorial Representation for Tensor Data , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[13]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[14]  Khaled Shaalan,et al.  Speech Recognition Using Deep Neural Networks: A Systematic Review , 2019, IEEE Access.

[15]  Yifan Gong,et al.  CNN with Phonetic Attention for Text-Independent Speaker Verification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Gustavo Carneiro,et al.  A Theoretically Sound Upper Bound on the Triplet Loss for Improving the Efficiency of Deep Distance Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jiasong Sun,et al.  Angular Softmax Loss for End-to-end Speaker Verification , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[20]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[21]  Sarthak Yadav,et al.  Learning Discriminative Features for Speaker Identification and Verification , 2018, INTERSPEECH.

[22]  John H. L. Hansen,et al.  Robust speech recognition in noise: an evaluation using the SPINE corpus , 2001, INTERSPEECH.

[23]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[24]  Bernard Ghanem,et al.  3D Instance Segmentation via Multi-Task Metric Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[28]  Tiago H. Falk,et al.  Combining Speaker Recognition and Metric Learning for Speaker-Dependent Representation Learning , 2019, INTERSPEECH.

[29]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[31]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[32]  Ping Li,et al.  Improving person re-identification by multi-task learning , 2019, Multimedia Tools and Applications.

[33]  Xiaofang Liu,et al.  DPGAN: PReLU Used in Deep Convolutional Generative Adversarial Networks , 2019, RSVT '19.

[34]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Thierry Poibeau,et al.  Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing , 2018, Computational Linguistics.

[36]  Wensheng Yan,et al.  Deep Residual SENet for Foliage Recognition , 2020, Trans. Edutainment.

[37]  Ekin D. Cubuk,et al.  A Fourier Perspective on Model Robustness in Computer Vision , 2019, NeurIPS.

[38]  Patrick Kenny,et al.  Deep Speaker Recognition: Modular or Monolithic? , 2019, INTERSPEECH.

[39]  Jian Wang,et al.  Deep Metric Learning with Angular Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  John H. L. Hansen,et al.  High performance digit recognition in real car environments , 2002, INTERSPEECH.

[41]  Philip S. Yu,et al.  Multi-task Network Embedding , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[42]  Carlos D. Castillo,et al.  An Experimental Evaluation of Covariates Effects on Unconstrained Face Verification , 2018, IEEE Transactions on Biometrics, Behavior, and Identity Science.

[43]  Shuai Wang,et al.  Angular Softmax for Short-Duration Text-independent Speaker Verification , 2018, INTERSPEECH.

[44]  Patrick Kenny,et al.  Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[45]  Chengzhu Yu,et al.  Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Weihong Deng,et al.  Deep embedding learning with adaptive large margin N-pair loss for image retrieval and clustering , 2019, Pattern Recognit..

[47]  Dong Yu,et al.  Deep Discriminative Embeddings for Duration Robust Speaker Verification , 2018, INTERSPEECH.

[48]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[49]  Oliver Durr,et al.  Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[50]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[51]  Haiyang Wang,et al.  A modified contrastive loss method for face recognition , 2019, Pattern Recognit. Lett..

[52]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[53]  Wei-Qiang Zhang,et al.  Towards Discriminative Representations and Unbiased Predictions: Class-Specific Angular Softmax for Speech Emotion Recognition , 2019, INTERSPEECH.

[54]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.