论文信息 - Robust Speaker Recognition Based on Stacked Auto-encoders

Robust Speaker Recognition Based on Stacked Auto-encoders

Speaker recognition is a biometric modality which utilize speaker’s speech segments to recognize identity, determining whether the test speaker belongs to one of the enrolled speakers. In order to improve the robustness of i-vector framework on cross-channel conditions and explore the nova method for applying deep learning to speaker recognition, the Stacked Auto-encoders is applied to get the abstract extraction of the i-vector instead of applying PLDA. After pre-processing and feature extraction, the speaker and channel independent speeches are employed for UBM training. The UBM is then used to extract the i-vector of the enrollment and test speech. Unlike the traditional i-vector framework, which uses linear discriminant analysis (LDA) to reduce dimension and increase the discrimination between speaker subspaces, this research use stacked auto-encoders to reconstruct the i-vector with lower dimension and different classifiers can be chosen to achieve final classification. The experimental results show that the proposed method achieves better performance than the-state-of-the-art method.

[1] Patrick Kenny,et al. Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[2] Li-Rong Dai,et al. Improvements on Deep Bottleneck Network based I-Vector Representation for Spoken Language Identification , 2016, Odyssey.

[3] Li-Rong Dai,et al. Deep bottleneck network based i-vector representation for language identification , 2015, INTERSPEECH.

[4] Björn W. Schuller,et al. Modeling gender information for emotion recognition using Denoising autoencoder , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[6] Douglas E. Sturim,et al. Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[7] B. Atal,et al. Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[8] Jean-Luc Gauvain,et al. A phone-based approach to non-linguistic speech feature identification , 1995, Comput. Speech Lang..

[9] William M. Campbell,et al. Generalized linear discriminant sequence kernels for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10] Sébastien Marcel,et al. Audio-visual gender recognition in uncontrolled environment using variability modeling techniques , 2014, IEEE International Joint Conference on Biometrics.

[11] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12] David A. van Leeuwen,et al. Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[14] James H. Elder,et al. Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15] Longbiao Wang,et al. Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification , 2015, EURASIP J. Audio Speech Music. Process..

[16] Elmar Nöth,et al. Age and gender recognition for telephone applications based on GMM supervectors and support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Sri Harish Reddy Mallidi,et al. Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[18] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .