Robust Speaker Recognition Based on Stacked Auto-encoders

Speaker recognition is a biometric modality which utilize speaker’s speech segments to recognize identity, determining whether the test speaker belongs to one of the enrolled speakers. In order to improve the robustness of i-vector framework on cross-channel conditions and explore the nova method for applying deep learning to speaker recognition, the Stacked Auto-encoders is applied to get the abstract extraction of the i-vector instead of applying PLDA. After pre-processing and feature extraction, the speaker and channel independent speeches are employed for UBM training. The UBM is then used to extract the i-vector of the enrollment and test speech. Unlike the traditional i-vector framework, which uses linear discriminant analysis (LDA) to reduce dimension and increase the discrimination between speaker subspaces, this research use stacked auto-encoders to reconstruct the i-vector with lower dimension and different classifiers can be chosen to achieve final classification. The experimental results show that the proposed method achieves better performance than the-state-of-the-art method.

[1]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[2]  Li-Rong Dai,et al.  Improvements on Deep Bottleneck Network based I-Vector Representation for Spoken Language Identification , 2016, Odyssey.

[3]  Li-Rong Dai,et al.  Deep bottleneck network based i-vector representation for language identification , 2015, INTERSPEECH.

[4]  Björn W. Schuller,et al.  Modeling gender information for emotion recognition using Denoising autoencoder , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[6]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[7]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[8]  Jean-Luc Gauvain,et al.  A phone-based approach to non-linguistic speech feature identification , 1995, Comput. Speech Lang..

[9]  William M. Campbell,et al.  Generalized linear discriminant sequence kernels for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Sébastien Marcel,et al.  Audio-visual gender recognition in uncontrolled environment using variability modeling techniques , 2014, IEEE International Joint Conference on Biometrics.

[11]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  David A. van Leeuwen,et al.  Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[14]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Longbiao Wang,et al.  Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification , 2015, EURASIP J. Audio Speech Music. Process..

[16]  Elmar Nöth,et al.  Age and gender recognition for telephone applications based on GMM supervectors and support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[18]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .