Multi-stream adaptive evidence combination for noise robust ASR

In this paper we develop different mathematical models in the framework of the multi-stream paradigm for noise robust ASR, and discuss their close relationship with human speech perception. Largely inspired by Fletcher's "product-of-errors" rule in psychoacoustics, multi-band ASR aims for robustness to data mismatch through the exploitation of spectral redundancy, while making minimum assumptions about noise type. Previous ASR tests have shown that independent sub-band processing can lead to decreased recognition performance with clean speech. We have overcome this problem by considering every combination of data sub-bands as an independent data stream. After introducing the background to multi-band ASR, we show how this "full combination" approach can be formalised, in the context of HMM/ANN based ASR, by introducing a latent variable to specify which data sub-bands in each data frame are free from data mismatch. This enables us to decompose the posterior probability for each phoneme into a reliability weighted integral over all possible positions of clean data. This approach offers great potential for adaptation to rapidly changing and unpredictable noise.

[1]  Andrew C. Morris Latent variable decomposition for posteriors or likelihood based subband ASR , 1999 .

[2]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[3]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Mark J. F. Gales,et al.  HMM recognition in noise using parallel model combination , 1993, EUROSPEECH.

[5]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[6]  Hervé Bourlard,et al.  Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems , 1997, EUROSPEECH.

[7]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[8]  Hervé Bourlard,et al.  Different Weighting Schemes in the Full Combination Subbands Approach for Noise Robust ASR , 1999 .

[9]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[10]  Herman J. M. Steeneken,et al.  Mutual dependence of the octave-band weights in predicting speech intelligibility , 1999, Speech Commun..

[11]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[12]  Alexander H. Waibel,et al.  Towards spontaneous speech recognition for on-board car navigation and information systems , 1999, EUROSPEECH.

[13]  Climent Nadeu,et al.  On the decorrelation of filter-bank energies in speech recognition , 1995, EUROSPEECH.

[14]  L. Girin,et al.  Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transitions , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Juergen Luettin,et al.  Using the multi-stream approach for continuous audio-visual speech recognition: experiments on the M2VTS database , 1998, ICSLP.

[16]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[17]  Bert Cranen,et al.  MISSING FEATURE THEORY IN ASR: MAKE SURE YOU MISS THE RIGHT TYPE OF FEATURES , 1999 .

[18]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[20]  Misha Pavel,et al.  Towards ASR on partially corrupted speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[21]  Harvey Fletcher,et al.  The nature of speech and its interpretation , 1922 .

[22]  Hervé Glotin,et al.  A new SNR-feature mapping for robust multistream speech recognition , 1999 .

[23]  Francis Jack Smith,et al.  Union: A new approach for combining sub-band observations for noisy speech recognition , 2001, Speech Commun..

[24]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[25]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[26]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[27]  M. Lawrence An Introduction to the Physiology of Hearing. , 1983 .

[28]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[30]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[31]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[32]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[33]  Roger K. Moore,et al.  Modelling asynchrony in speech using elementary single-signal decomposition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Hervé Bourlard,et al.  The full combination sub-bands approach to noise robust HMM/ANN based ASR , 1999, EUROSPEECH.

[35]  Hervé Bourlard,et al.  Non-Stationary Multi-Channel (Multi-Stream) Processing Towards Robust and Adaptive ASR , 1999 .

[36]  Richard Lippmann,et al.  Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise KN-37 , 1997, EUROSPEECH.

[37]  Hervé Glotin,et al.  A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition , 1999, EUROSPEECH.

[38]  Hervé Bourlard,et al.  Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions , 1997, Summer School on Neural Networks.

[39]  J. Pickles An Introduction to the Physiology of Hearing , 1982 .

[40]  Steven Greenberg,et al.  Performance improvements through combining phone- and syllable-scale information in automatic speech recognition , 1998, ICSLP.

[41]  Phil D. Green,et al.  Some solution to the missing feature problem in data classification, with application to noise robust ASR , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[42]  Hans-Günter Hirsch,et al.  Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[43]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[44]  William A. Pearlman,et al.  Analysis of linear prediction, coding, and spectral estimation from subbands , 1996, IEEE Trans. Inf. Theory.

[45]  Steven Greenberg,et al.  ON THE ORIGINS OF SPEECH INTELLIGIBILITY IN THE REAL WORLD , 1997 .