On the use of DNN Autoencoder for Robust Speaker Recognition

In this paper, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker recognition system. We started with augmenting the Fisher database with artificially noised and reverberated data and we trained the autoencoder to map noisy and reverberated speech to its clean version. We use the autoencoder as a preprocessing step for a state-of-the-art text-independent speaker recognition system. We compare results achieved with pure autoencoder enhancement, multi-condition PLDA training and their simultaneous use. We present a detailed analysis with various conditions of NIST SRE 2010, PRISM and artificially corrupted NIST SRE 2010 telephone condition. We conclude that the proposed preprocessing significantly outperforms the baseline and that this technique can be used to build a robust speaker recognition system for reverberated and noisy data.

[1]  Jun Du,et al.  Deep neural network based speech separation for robust speech recognition , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[2]  Bisrat Derebssa Dufera,et al.  Reverberated speech enhancement using neural networks , 2009, 2009 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS).

[3]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[4]  Speech Processing , Transmission and Quality Aspects ( STQ ) ; Test Methodologies for ETSI Test Events and Results ; Part 2 : 1 st ETSI Plugtests Speech Quality Test Event Report , 2022 .

[5]  Tatsuya Kawahara,et al.  Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature , 2015, EURASIP J. Adv. Signal Process..

[6]  Jun Du,et al.  Global variance equalization for improving deep neural network based speech enhancement , 2014, 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP).

[7]  Bhiksha Raj,et al.  Microphone array processing for distant speech recognition: Towards real-world deployment , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[8]  Ronaldus Maria Aarts,et al.  A Comparison of Some Loudness Measures for Loudspeaker Listening Tests , 1992 .

[9]  Hynek Hermansky,et al.  Developing a speaker identification system for the DARPA RATS project , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Yun Lei,et al.  Towards noise-robust speaker recognition using probabilistic linear discriminant analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Mark B. Sandler,et al.  Database of omnidirectional and B-format room impulse responses , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[14]  Yun Lei,et al.  Unscented transform for ivector-based noisy speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[16]  Hagai Aronowitz,et al.  Audio enhancing with DNN autoencoder for speaker recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  L. Burget,et al.  Promoting robustness for speaker modeling in the community: the PRISM evaluation set , 2011 .

[18]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.