Towards Robust and Adaptive Speech Recognition Models

In this paper, we discuss a family of new Automatic Speech Recognition (ASR) approaches, which somewhat deviate from the usual ASR approaches but which have recently been shown to be more robust to nonstationary noise, without requiring specific adaptation or “multi-style” training. More specifically, we will motivate and briefly describe new approaches based on multi-stream and subband ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) streams representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[1]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[2]  Hervé Bourlard,et al.  Subband-Based Speech Recognition in Noisy Conditions: The Full Combination Approach , 1998 .

[3]  Hervé Bourlard,et al.  Different Weighting Schemes in the Full Combination Subbands Approach for Noise Robust ASR , 1999 .

[4]  Climent Nadeu,et al.  On the decorrelation of filter-bank energies in speech recognition , 1995, EUROSPEECH.

[5]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[6]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[7]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Hervé Bourlard,et al.  Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems , 1997, EUROSPEECH.

[9]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  Steven Greenberg,et al.  Performance improvements through combining phone- and syllable-scale information in automatic speech recognition , 1998, ICSLP.

[11]  Samy Bengio,et al.  IDIAP HMM/HMM2 System: Theoretical Basis and Software Specifications , 2001 .

[12]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[13]  Phil D. Green,et al.  Some solution to the missing feature problem in data classification, with application to noise robust ASR , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[14]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[15]  Roger K. Moore,et al.  Modelling asynchrony in speech using elementary single-signal decomposition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Jussi Kangasharju,et al.  The use of meta-HMM in multistream HMM training for automatic speech recognition , 1998, ICSLP.

[17]  Richard Lippmann,et al.  Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise KN-37 , 1997, EUROSPEECH.

[18]  Hervé Bourlard,et al.  The full combination sub-bands approach to noise robust HMM/ANN based ASR , 1999, EUROSPEECH.

[19]  Bert Cranen,et al.  MISSING FEATURE THEORY IN ASR: MAKE SURE YOU MISS THE RIGHT TYPE OF FEATURES , 1999 .

[20]  Jont B. Allen How do humans process and recognize speech , 1993 .

[21]  Steven Greenberg,et al.  ON THE ORIGINS OF SPEECH INTELLIGIBILITY IN THE REAL WORLD , 1997 .

[22]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[23]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[24]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[25]  William A. Pearlman,et al.  Analysis of linear prediction, coding, and spectral estimation from subbands , 1996, IEEE Trans. Inf. Theory.

[26]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[27]  Nikki Mirghafori,et al.  Transmissions and transitions: a study of two common assumptions in multi-band ASR , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Hervé Glotin,et al.  A new SNR-feature mapping for robust multistream speech recognition , 1999 .

[29]  Samy Bengio,et al.  HMM2- Extraction of Formant Features and their Use for Robust ASR , 2001 .

[30]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[31]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[32]  Samy Bengio,et al.  An EM Algorithm for HMMs with Emission Distributions Represented by HMMs , 2000 .