A noise-robust speech recognition approach incorporating normalized speech/non-speech likelihood into hypothesis scores

In noisy environments, speech recognition decoders often incorrectly produce speech hypotheses for non-speech periods, and non-speech hypotheses, such as silence or a short pause, for speech periods. It is crucial to reduce such errors to improve the performance of speech recognition systems. This paper proposes an approach using normalized speech/non-speech likelihoods calculated using adaptive speech and non-speech GMMs to weight the scores of recognition hypotheses produced by the decoder. To achieve good decoding performance, the GMMs are adapted to the variations of acoustic characteristics of input utterances and environmental noise, using either of the two modern on-line unsupervised adaptation methods, switching Kalman filter (SKF) or maximum a posteriori (MAP) estimation. Experimental results on real-world in-car speech, the Drivers' Japanese Speech Corpus in a Car Environment (DJSC), and the AURORA-2 database show that the proposed method significantly improves recognition accuracy compared to a conventional approach using front-end voice activity detection (VAD). Results also confirm that our method significantly improves recognition accuracy under various noise and task conditions.

[1]  Sadaoki Furui,et al.  VAD-measure-embedded decoder with online model adaptation , 2010, INTERSPEECH.

[2]  Michael S. Scordilis,et al.  Effective online unsupervised adaptation of Gaussian mixture models and its application to speech classification , 2008, Pattern Recognit. Lett..

[3]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[4]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[5]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[6]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[7]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[8]  Richard M. Stern,et al.  Speech in Noisy Environments: robust automatic segmentation, feature extraction, and hypothesis combination , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9]  Sadaoki Furui,et al.  Noise robust speech recognition using F0 contour extracted by hough transform , 2002, INTERSPEECH.

[10]  Louis H. Terry,et al.  Audio-Visual and Visual-Only Speech and Speaker Recognition: Issues about Theory, System Design, and Implementation , 2008 .

[11]  Masakiyo Fujimoto,et al.  Noise Robust Voice Activity Detection Based on Switching Kalman Filter , 2008, IEICE Trans. Inf. Syst..

[12]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[13]  Sadaoki Furui,et al.  The Titech large vocabulary WFST speech recognition system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[14]  Masakiyo Fujimoto,et al.  Noise Robust Voice Activity Detection Based on Statistical Model and Parallel Non-Linear Kalman Filtering , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Sadaoki Furui,et al.  A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.