论文信息 - End-to-End Residual CNN with L-GM Loss Speaker Verification System

End-to-End Residual CNN with L-GM Loss Speaker Verification System

We propose an end-to-end speaker verification system based on the neural network and trained by a loss function with less computational complexity. The end-to-end speaker verification system in this paper consists of a ResNet architecture to extract features from utterance, then produces utterance-level speaker embeddings, and train using the large-margin Gaussian Mixture loss function. Influenced by the large-margin and likelihood regularization, large-margin Gaussian Mixture loss function benefits the speaker verification performance. Experimental results demonstrate that the Residual CNN with large-margin Gaussian Mixture loss outperforms DNN-based i-vector baseline by more than 10% improvement in accuracy rate.

Xuan Shi | Xingjian Du | Mengyao Zhu

[1] Abhishek Verma,et al. Residual squeeze CNDS deep learning CNN model for very large scale places image recognition , 2017, 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON).

[2] Nicolas Lefebvre,et al. Traffic flow estimation using acoustic signal , 2017, Eng. Appl. Artif. Intell..

[3] William M. Campbell,et al. Speaker Verification Using Support Vector Machines and High-Level Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Patrick Kenny,et al. A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] John H. L. Hansen,et al. A Study on Universal Background Model Training in Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Pietro Laface,et al. On the use of i–vector posterior distributions in Probabilistic Linear Discriminant Analysis , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Sergey Novoselov,et al. On Residual CNN in Text-Dependent Speaker Verification Task , 2017, SPECOM.

[10] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11] Georg Heigold,et al. End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Jiansheng Chen,et al. Rethinking Feature Distribution for Loss Functions in Image Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[14] Driss Matrouf,et al. State-of-the-Art Performance in Text-Independent Speaker Verification Through Open-Source Software , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[16] Aaron Lawson,et al. The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[17] Yu Qiao,et al. A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[18] Xiao Liu,et al. Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[19] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Pietro Laface,et al. Probabilistic linear discriminant analysis of i-vector posterior distributions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.