Distant-Talking Speech Recognition Based on Spectral Subtraction by Multi-Channel LMS Algorithm

SUMMARY We propose a blind dereverberation method based on spectral subtraction using a multi-channel least mean squares (MCLMS) algorithm for distant-talking speech recognition. In a distant-talking environment, the channel impulse response is longer than the short-term spectral analysis window. By treating the late reverberation as additive noise, a noise reduction technique based on spectral subtraction was proposed to estimate the power spectrum of the clean speech using power spectra of the distorted speech and the unknown impulse responses. To estimate the power spectra of the impulse responses, a variable step-size unconstrained MCLMS (VSS-UMCLMS) algorithm for identifying the impulse responses in a time domain is extended to a frequency domain. To reduce the effect of the estimation error of the channel impulse response, we normalize the early reverberation by cepstral mean normalization (CMN) instead of spectral subtraction using the estimated impulse response. Furthermore, our proposed method is combined with conventional delay-andsum beamforming. We conducted recognition experiments on a distorted speech signal simulated by convolving multi-channel impulse responses with clean speech. The proposed method achieved a relative error reduction rate of 22.4% in relation to conventional CMN. By combining the proposed method with beamforming, a relative error reduction rate of 24.5% in relation to the conventional CMN with beamforming was achieved using only an isolated word (with duration of about 0.6s) to estimate the spectrum of the impulse response.

[1]  Ivan Tashev,et al.  REVEREBERATION REDUCTION FOR IMPROVED SPEECH RECOGNITION , 2004 .

[2]  Athina P. Petropulu,et al.  Cepstrum based deconvolution for speech dereverberation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Jacob Benesty,et al.  Adaptive blind channel identification: Multi-channel least mean square and Newton algorithms , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Hynek Hermansky,et al.  Study on the dereverberation of speech based on temporal envelope filtering , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Hynek Hermansky,et al.  Speech enhancement based on temporal processing , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jingdong Chen,et al.  Acoustic MIMO Signal Processing , 2006 .

[7]  Jacob Benesty,et al.  Adaptive multi-channel least mean square and Newton algorithms for blind channel identification , 2002, Signal Process..

[8]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[9]  Longbiao Wang,et al.  Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM , 2007, Speech Commun..

[10]  Marc Delcroix,et al.  Inverse Filtering for Speech Dereverberation Less Sensitive to Noise and Room Transfer Function Fluctuations , 2007, EURASIP J. Adv. Signal Process..

[11]  Richard M. Stern,et al.  Efficient Cepstral Normalization for Robust Speech Recognition , 1993, HLT.

[12]  Longbiao Wang,et al.  Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN , 2006, EURASIP J. Adv. Signal Process..

[13]  T. Kailath,et al.  A least-squares approach to blind channel identification , 1995, IEEE Trans. Signal Process..

[14]  Jacob Benesty,et al.  Optimal step size of the adaptive multichannel LMS algorithm for blind SIMO identification , 2005, IEEE Signal Processing Letters.

[15]  Hynek Hermansky,et al.  Multiresolution channel normalization for ASR in reverberant environments , 1997, EUROSPEECH.

[16]  Han-Fu Chen,et al.  Convergence of stochastic-approximation-based algorithms for blind channel identification , 2002, IEEE Trans. Inf. Theory.

[17]  Shigeki Sagayama,et al.  Model Adaptation for Long Convolutional Distortion by Maximum Likelihood Based State Filtering Approach , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[18]  Tanja Schultz,et al.  Far-Field Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[20]  Marc Moonen,et al.  Subspace Methods for Multimicrophone Speech Dereverberation , 2003, EURASIP J. Adv. Signal Process..

[21]  Marc Delcroix,et al.  Precise Dereverberation Using Multichannel Linear Prediction , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[23]  Shigeki Sagayama,et al.  Model adaptation by state splitting of HMM for long reverberation , 2005, INTERSPEECH.

[24]  Longbiao Wang,et al.  Robust Speech Recognition by Combining Short-Term and Long-Term Spectrum Based Position-Dependent CMN with Conventional CMN , 2008, IEICE Trans. Inf. Syst..