论文信息 - Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition

Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition

In this work, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker verification (SV) system. We start our approach by carefully designing a data augmentation process to cover wide range of acoustic conditions and obtain rich training data for various components of our SV system. We augment several well-known databases used in SV with artificially noised and reverberated data and we use them to train a denoising autoencoder (mapping noisy and reverberated speech to its clean version) as well as an x-vector extractor which is currently considered as state-of-the-art in SV. Later, we use the autoencoder as a preprocessing step for text-independent SV system. We compare results achieved with autoencoder enhancement, multi-condition PLDA training and their simultaneous use. We present a detailed analysis with various conditions of NIST SRE 2010, 2016, PRISM and with re-transmitted data. We conclude that the proposed preprocessing can significantly improve both i-vector and x-vector baselines and that this technique can be used to build a robust SV system for various target domains.

[1] Bisrat Derebssa Dufera,et al. Reverberated speech enhancement using neural networks , 2009, 2009 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS).

[2] Sanjeev Khudanpur,et al. A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Sri Harish Reddy Mallidi,et al. Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[4] Pavel Matejka,et al. On the use of X-vectors for Robust Speaker Recognition , 2018, Odyssey.

[5] Jun Du,et al. Global variance equalization for improving deep neural network based speech enhancement , 2014, 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP).

[6] Douglas E. Sturim,et al. Speaker adaptive cohort selection for Tnorm in text-independent speaker verification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7] Pavel Matejka,et al. Dereverberation and Beamforming in Far-Field Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Jun Du,et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[9] Maurizio Omologo,et al. Realistic Multi-Microphone Data Simulation for Distant Speech Recognition , 2016, INTERSPEECH.

[10] James H. Elder,et al. Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11] David Talkin,et al. A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[12] Niko Brümmer,et al. A Generative Model for Score Normalization in Speaker Recognition , 2017, INTERSPEECH.

[13] Jan Cernocký,et al. BUT 2014 Babel system: analysis of adaptation in NN based systems , 2014, INTERSPEECH.

[14] Sanjeev Khudanpur,et al. Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[15] Georg Heigold,et al. End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Patrick Kenny,et al. Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[17] Tatsuya Kawahara,et al. Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature , 2015, EURASIP J. Adv. Signal Process..

[18] Spyridon Matsoukas,et al. Domain adaptation via within-class covariance correction in I-vector based speaker recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] L. Burget,et al. Promoting robustness for speaker modeling in the community: the PRISM evaluation set , 2011 .

[20] Aaron Lawson,et al. The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[21] Yun Lei,et al. Unscented transform for ivector-based noisy speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Yifan Gong,et al. End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[23] Jun Du,et al. Deep neural network based speech separation for robust speech recognition , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[24] Lukás Burget,et al. Analysis of DNN approaches to speaker identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Lukás Burget,et al. Analysis and Optimization of Bottleneck Features for Speaker Recognition , 2016, Odyssey.

[26] Mireia Díez,et al. End-to-End DNN Based Speaker Recognition Inspired by I-Vector and PLDA , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Lukás Burget,et al. Analysis of Score Normalization in Multilingual Speaker Recognition , 2017, INTERSPEECH.

[28] Jens Edlund,et al. A Snack Implementation and Tcl/Tk Interface to the Fundamental Frequency Variation Spectrum Algorithm , 2010, LREC.

[29] Hagai Aronowitz,et al. Audio enhancing with DNN autoencoder for speaker recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Daniel Garcia-Romero,et al. Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[31] Patrick Kenny,et al. Modelling speaker and channel variability using deep neural networks for robust speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[32] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[33] Lukás Burget,et al. Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[34] Bhiksha Raj,et al. Microphone array processing for distant speech recognition: Towards real-world deployment , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[35] Ronaldus Maria Aarts,et al. A Comparison of Some Loudness Measures for Loudspeaker Listening Tests , 1992 .

[36] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Niko Brümmer,et al. Analysis and Description of ABC Submission to NIST SRE 2016 , 2017, INTERSPEECH.

[38] Sergey Novoselov,et al. Non-linear PLDA for i-vector speaker verification , 2015, INTERSPEECH.

[39] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Lukás Burget,et al. Analysis of the DNN-based SRE systems in multi-language conditions , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[41] Pavel Matejka,et al. On the use of DNN Autoencoder for Robust Speaker Recognition , 2018, ArXiv.

[42] Hynek Hermansky,et al. Developing a speaker identification system for the DARPA RATS project , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43] Yun Lei,et al. Towards noise-robust speaker recognition using probabilistic linear discriminant analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44] Erik McDermott,et al. Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] Sridha Sridharan,et al. Feature warping for robust speaker verification , 2001, Odyssey.

[46] Sanjeev Khudanpur,et al. Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[47] Mark B. Sandler,et al. Database of omnidirectional and B-format room impulse responses , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48] Patrick Kenny,et al. Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[49] Florin Curelaru,et al. Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[50] Javier Hernando,et al. Deep belief networks for i-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).