Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

In this paper, we propose a robust distant-talking speech recognition by combining cepstral domain denoising autoencoder (DAE) and temporal structure normalization (TSN) filter. As DAE has a deep structure and nonlinear processing steps, it is flexible enough to model highly nonlinear mapping between input and output space. In this paper, we train a DAE to map reverberant and noisy speech features to the underlying clean speech features in the cepstral domain. For the proposed method, after applying a DAE in the cepstral domain of speech to suppress reverberation, we apply a post-processing technology based on temporal structure normalization (TSN) filter to reduce the noise and reverberation effects by normalizing the modulation spectra to reference spectra of clean speech. The proposed method was evaluated using speech in simulated and real reverberant environments. By combining a cepstral-domain DAE and TSN, the average Word Error Rate (WER) was reduced from 25.2 % of the baseline system to 21.2 % in simulated environments and from 47.5 % to 41.3 % in real environments, respectively.

[1]  Hans-Günter Hirsch,et al.  A new approach for the adaptation of HMMs to reverberation and background noise , 2008, Speech Commun..

[2]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[3]  John H. L. Hansen,et al.  Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Marc Delcroix,et al.  Precise Dereverberation Using Multichannel Linear Prediction , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Yasuo Horiuchi,et al.  Reverberant speech recognition based on denoising autoencoder , 2013, INTERSPEECH.

[6]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[7]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[8]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[9]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[10]  Sergios Theodoridis,et al.  A Novel Efficient Cluster-Based MLSE Equalizer for Satellite Communication Channels with-QAM Signaling , 2006, EURASIP J. Adv. Signal Process..

[11]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Longbiao Wang,et al.  Hands-free speaker identification based on spectral subtraction using a multi-channel least mean square approach , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[14]  Marc Moonen,et al.  Joint DOA and multi-pitch estimation based on subspace techniques , 2012, EURASIP J. Adv. Signal Process..

[15]  Matthias Wölfel,et al.  Enhanced Speech Features by Single-Channel Joint Compensation of Noise and Reverberation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Tanja Schultz,et al.  Far-Field Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Roland Maas,et al.  Reverberation Model-Based Decoding in the Logmelspec Domain for Robust Distant-Talking Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Tomohiro Nakatani,et al.  Spectral Subtraction Steered by Multi-Step Forward Linear Prediction For Single Channel Speech Dereverberation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19]  Richard M. Stern,et al.  Efficient Cepstral Normalization for Robust Speech Recognition , 1993, HLT.

[20]  Longbiao Wang,et al.  Distant-Talking Speech Recognition Based on Spectral Subtraction by Multi-Channel LMS Algorithm , 2011, IEICE Trans. Inf. Syst..

[21]  Mitch Weintraub,et al.  NONLINEAR DISCRIMINANT FEATURE EXTRACTION FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION , 1997 .

[22]  Haizhou Li,et al.  Normalization of the Speech Modulation Spectra for Robust Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[24]  Longbiao Wang,et al.  Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN , 2006, EURASIP J. Adv. Signal Process..

[25]  Longbiao Wang,et al.  Improvement of distant-talking speaker identification using bottleneck features of DNN , 2013, INTERSPEECH.

[26]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[27]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[28]  Longbiao Wang,et al.  Dereverberation and denoising based on generalized spectral subtraction by multi-channel LMS algorithm using a small-scale microphone array , 2012, EURASIP Journal on Advances in Signal Processing.

[29]  Emanuel A. P. Habets,et al.  Multi-channel speech dereverberation based on a statistical model of late reverberation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[30]  Haizhou Li,et al.  Temporal Structure Normalization of Speech Feature for Robust Speech Recognition , 2007, IEEE Signal Processing Letters.

[31]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[32]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[33]  Longbiao Wang,et al.  Joint sparse representation based cepstral-domain dereverberation for distant-talking speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Longbiao Wang,et al.  Robust Distant Speech Recognition by Combining Position-Dependent CMN with Conventional CMN , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  Andreas Stolcke,et al.  Using MLP features in SRI's conversational speech recognition system , 2005, INTERSPEECH.