THE NTU-ADSC SYSTEMS FOR REVERBERATION CHALLENGE 2014

This paper describes our speech enhancement and recognition systems developed for the Reverberation Challenge 2014. To enhance the noisy and reverberant speech for human listening, besides using conventional methods such as delay and sum beamformer and late reverberation reduction by spectral subtraction, we also studied a novel learning-based speech enhancement. Specifically, we train deep neural networks (DNN) to map reverberant spectrogram to the corresponding clean spectrogram by using parallel data of clean and reverberant speech. Results show that the trained DNN is able to reduce reverberation significantly for unseen test data. For the speech recognition task, when parallel data is available, we train a DNN tomap reverberant features to clean features, following the same spirit as the DNN-based speech enhancement. Results show that the DNN-based feature compensation improves speech recognition performance even when a DNN acoustic model is already used, showing the benefit of explicitly cleansing the features. When parallel data is not available in the clean condition training scheme, we focus on reducing the training-test mismatch by using our proposed cross transform feature adaptation that uses both temporal and spectral information. The cross transform works complementarily with traditional model adaptation.

[1]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[2]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[3]  Jont B. Allen,et al.  Multimicrophone signal‐processing technique to remove room reverberation from speech signals , 1977 .

[4]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[5]  R. Zelinski,et al.  A microphone array with adaptive post-filtering for noise reduction in reverberant rooms , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[7]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[8]  Ta‐Hsin Li ESTIMATION AND BLIND DECONVOLUTION OF AUTOREGRESSIVE SYSTEMS WITH NONSTATIONARY BINARY INPUTS , 1993 .

[9]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[10]  Rong Chen,et al.  Blind restoration of linearly degraded discrete signals by Gibbs sampling , 1995, IEEE Trans. Signal Process..

[11]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[12]  Athina P. Petropulu,et al.  Cepstrum-based deconvolution for speech dereverberation , 1996, IEEE Trans. Speech Audio Process..

[13]  Sven Fischer,et al.  Beamforming microphone arrays for speech acquisition in noisy environments , 1996, Speech Commun..

[14]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[15]  Eric Moulines,et al.  Simulation-based methods for blind maximum-likelihood filter identification , 1999, Signal Process..

[16]  Shigeru Katagiri,et al.  Handbook of Neural Networks for Speech Processing , 2000 .

[17]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[18]  Hakan Erdogan,et al.  Incremental on-line feature space MLLR adaptation for telephony speech recognition , 2002, INTERSPEECH.

[19]  Antony William Rix,et al.  Perceptual evaluation of speech quality (PESQ): The new ITU standard for end-to-end speech quality a , 2002 .

[20]  Marc Moonen,et al.  Subspace Methods for Multimicrophone Speech Dereverberation , 2003, EURASIP J. Adv. Signal Process..

[21]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[22]  Dirk T. M. Slock,et al.  Delay and Predict Equalization for Blind Speech Dereverberation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  R. Seara,et al.  Spectral subtraction for reverberation reduction applied to automatic speech recognition , 2006, 2006 International Telecommunications Symposium.

[24]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[25]  Marc Delcroix,et al.  Precise Dereverberation Using Multichannel Linear Prediction , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Haizhou Li,et al.  Normalization of the Speech Modulation Spectra for Robust Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Peter Vary,et al.  An Improved Algorithm for Blind Reverberation Time Estimation , 2010 .

[30]  Emanuel A. P. Habets,et al.  New Insights Into the MVDR Beamformer in Room Acoustics , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[33]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[34]  Emanuel A. P. Habets,et al.  A Two-Stage Beamforming Approach for Noise Reduction and Dereverberation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[36]  Haizhou Li,et al.  Temporal filter design by minimum KL divergence criterion for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[38]  Haizhou Li,et al.  Generalization of temporal filter and linear transformation for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).