Blind Model Selection for Automatic Speech Recognition in Reverberant Environments

This communication presents a new method for automatic speech recognition in reverberant environments. Our approach consists in the selection of the best acoustic model out of a library of models trained on artificially reverberated speech databases corresponding to various reverberant conditions. Given a speech utterance recorded within a reverberant room, a Maximum Likelihood estimate of the fullband room reverberation time is computed using a statistical model for short-term log-energy sequences of anechoic speech. The estimated reverberation time is then used to select the best acoustic model, i.e., the model trained on a speech database most closely matching the estimated reverberation time, which serves to recognize the reverberated speech utterance. The proposed model selection approach is shown to improve significantly recognition accuracy for a connected digit task in both simulated and real reverberant environments, outperforming standard channel normalization techniques.

[1]  I. Miller Probability, Random Variables, and Stochastic Processes , 1966 .

[2]  Sarel van Vuuren,et al.  Data based filter design for RASTA-like channel normalization in ASR , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Jont B. Allen,et al.  Invertibility of a room impulse response , 1979 .

[4]  Heinrich Kuttruff,et al.  Room acoustics , 1973 .

[5]  James A. Moorer,et al.  About This Reverberation Business , 1978 .

[6]  Steven Greenberg,et al.  Improving ASR Performance For Reverberant Speech , 1997 .

[7]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[8]  Patrick Kenny,et al.  A linear predictive HMM for vector-valued observations with applications to speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[9]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[10]  Alexander H. Waibel,et al.  The effects of room acoustics on MFCC speech parameter , 2000, INTERSPEECH.

[11]  Harald Höge,et al.  SPEECON - Speech Data for Consumer Devices , 2000, LREC.

[12]  Alexander Fischer,et al.  Acoustic synthesis of training data for speech recognition in living room environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Sridha Sridharan,et al.  Position-Independent Enhancement of Reverberant Speech , 1997 .

[14]  Satoshi Nakamura,et al.  HMM-separation-based speech recognition for a distant moving speaker , 2001, IEEE Trans. Speech Audio Process..

[15]  Satoshi Nakamura,et al.  Room acoustics and reverberation: impact on hands-free recognition , 1997, EUROSPEECH.

[16]  P. Peterson Simulating the response of multiple microphones to a single acoustic source in a reverberant room. , 1986, The Journal of the Acoustical Society of America.

[17]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[18]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  George K. Kokkinakis,et al.  Improving simultaneous speech recognition in real room environments using overdetermined blind source separation , 2001, INTERSPEECH.

[20]  Barry Y. Chen,et al.  On data-derived temporal processing in speech feature extraction , 2000, INTERSPEECH.

[21]  Yannick Mahieux,et al.  Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering , 1998, IEEE Trans. Speech Audio Process..

[22]  F. Asano,et al.  An optimum computer‐generated pulse signal suitable for the measurement of very long impulse responses , 1995 .

[23]  Maurizio Omologo,et al.  Training of HMM with filtered speech material for hands-free recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[25]  Laurent Couvreur Fast Adaptation for Robust Speech Recognition in Reverberant Environments , 2001 .

[26]  Maurizio Omologo,et al.  Hands-free speech recognition using a filtered clean corpus and incremental HMM adaptation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[27]  Shoko Araki,et al.  Separation and dereverberation performance of frequency domain blind source separation for speech in a reverberant environment , 2001, INTERSPEECH.

[28]  Kazuya Takeda,et al.  Compensating of room acoustic transfer functions affected by change of room temperature , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[29]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[30]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[31]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Rodney A. Kennedy,et al.  Equalization in an acoustic reverberant environment: robustness results , 2000, IEEE Trans. Speech Audio Process..

[33]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[34]  Satoshi Nakamura,et al.  Speech enhancement based on the subspace method , 2000, IEEE Trans. Speech Audio Process..

[35]  Benoît Champagne,et al.  A microphone array processing technique for speech enhancement in a reverberant space , 1996, Speech Communication.

[36]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[37]  Barry Y. Chen,et al.  Data-driven RASTA filters in reverberation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[38]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[39]  Hynek Hermansky,et al.  Enhancement of reverberant speech using LP residual , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[40]  Hynek Hermansky,et al.  Study on the dereverberation of speech based on temporal envelope filtering , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[41]  Athina P. Petropulu,et al.  Cepstrum based deconvolution for speech dereverberation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Christophe Ris,et al.  A corpus-based approach for robust ASR in reverberant environments , 2000, INTERSPEECH.