Stacked Autoencoder Networks Based Speaker Recognition

Speech signals convey a variety of mixed information, from language to speaker-specific information. However, the information may prevent the speech or speaker recognition system from producing better performance. By unsupervised pre-training and supervised fine-tuning, the deep neural network based on autoencoders can effectively extract critical information. In this paper, we propose a hybrid model combining Stacked Autoencoder and mel-frequency cepstral coefficients (MFCCs) to improve the system performance. The experimental results show that the accuracy of speaker recognition system in TIMIT corpus achieve an improvement which is up to 93.3% by the combination of deep learning network and MFCC features, especially in female.

[1]  Sergey Novoselov,et al.  On autoencoders in the i-vector space for speaker recognition , 2016, Odyssey.

[2]  Mahesh Chandra,et al.  Speaker recognition and verification using artificial neural network , 2017, 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET).

[3]  Ponani S. Gopalakrishnan,et al.  Compression of acoustic features for speech recognition in network environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[6]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7]  Ahmad Salman,et al.  Learning Speaker-Specific Characteristics With a Deep Neural Architecture , 2011, IEEE Transactions on Neural Networks.

[8]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[9]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[10]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[11]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[12]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .