论文信息 - Multi-stream adaptive evidence combination for noise robust ASR

Multi-stream adaptive evidence combination for noise robust ASR

In this paper we develop different mathematical models in the framework of the multi-stream paradigm for noise robust ASR, and discuss their close relationship with human speech perception. Largely inspired by Fletcher's "product-of-errors" rule in psychoacoustics, multi-band ASR aims for robustness to data mismatch through the exploitation of spectral redundancy, while making minimum assumptions about noise type. Previous ASR tests have shown that independent sub-band processing can lead to decreased recognition performance with clean speech. We have overcome this problem by considering every combination of data sub-bands as an independent data stream. After introducing the background to multi-band ASR, we show how this "full combination" approach can be formalised, in the context of HMM/ANN based ASR, by introducing a latent variable to specify which data sub-bands in each data frame are free from data mismatch. This enables us to decompose the posterior probability for each phoneme into a reliability weighted integral over all possible positions of clean data. This approach offers great potential for adaptation to rapidly changing and unpredictable noise.

Hervé Glotin | Hervé Bourlard | Andrew C. Morris | Astrid Hagen

[1] Andrew C. Morris. Latent variable decomposition for posteriors or likelihood based subband ASR , 1999 .

[2] Richard Lippmann,et al. Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[3] Peter E. Hart,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4] Mark J. F. Gales,et al. HMM recognition in noise using parallel model combination , 1993, EUROSPEECH.

[5] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[6] Hervé Bourlard,et al. Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems , 1997, EUROSPEECH.

[7] Hynek Hermansky,et al. RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[8] Hervé Bourlard,et al. Different Weighting Schemes in the Full Combination Subbands Approach for Noise Robust ASR , 1999 .

[9] Jont B. Allen,et al. How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[10] Herman J. M. Steeneken,et al. Mutual dependence of the octave-band weights in predicting speech intelligibility , 1999, Speech Commun..

[11] Ronald A. Cole,et al. New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[12] Alexander H. Waibel,et al. Towards spontaneous speech recognition for on-board car navigation and information systems , 1999, EUROSPEECH.

[13] Climent Nadeu,et al. On the decorrelation of filter-bank energies in speech recognition , 1995, EUROSPEECH.

[14] L. Girin,et al. Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transitions , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15] Juergen Luettin,et al. Using the multi-stream approach for continuous audio-visual speech recognition: experiments on the M2VTS database , 1998, ICSLP.

[16] A. B.,et al. SPEECH COMMUNICATION , 2001 .

[17] Bert Cranen,et al. MISSING FEATURE THEORY IN ASR: MAKE SURE YOU MISS THE RIGHT TYPE OF FEATURES , 1999 .

[18] Martin J. Russell,et al. Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[20] Misha Pavel,et al. Towards ASR on partially corrupted speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[21] Harvey Fletcher,et al. The nature of speech and its interpretation , 1922 .

[22] Hervé Glotin,et al. A new SNR-feature mapping for robust multistream speech recognition , 1999 .

[23] Francis Jack Smith,et al. Union: A new approach for combining sub-band observations for noisy speech recognition , 2001, Speech Commun..

[24] E. Owens,et al. An Introduction to the Psychology of Hearing , 1997 .

[25] Hynek Hermansky,et al. Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[26] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[27] M. Lawrence. An Introduction to the Physiology of Hearing. , 1983 .

[28] Alexandros Potamianos,et al. Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29] Roger K. Moore,et al. Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[30] Robert A. Jacobs,et al. Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[31] Steven Greenberg,et al. Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[32] Hervé Bourlard,et al. A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[33] Roger K. Moore,et al. Modelling asynchrony in speech using elementary single-signal decomposition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34] Hervé Bourlard,et al. The full combination sub-bands approach to noise robust HMM/ANN based ASR , 1999, EUROSPEECH.

[35] Hervé Bourlard,et al. Non-Stationary Multi-Channel (Multi-Stream) Processing Towards Robust and Adaptive ASR , 1999 .

[36] Richard Lippmann,et al. Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise KN-37 , 1997, EUROSPEECH.

[37] Hervé Glotin,et al. A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition , 1999, EUROSPEECH.

[38] Hervé Bourlard,et al. Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions , 1997, Summer School on Neural Networks.

[39] J. Pickles. An Introduction to the Physiology of Hearing , 1982 .

[40] Steven Greenberg,et al. Performance improvements through combining phone- and syllable-scale information in automatic speech recognition , 1998, ICSLP.

[41] Phil D. Green,et al. Some solution to the missing feature problem in data classification, with application to noise robust ASR , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[42] Hans-Günter Hirsch,et al. Noise estimation techniques for robust speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[43] Hervé Bourlard,et al. Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[44] William A. Pearlman,et al. Analysis of linear prediction, coding, and spectral estimation from subbands , 1996, IEEE Trans. Inf. Theory.

[45] Steven Greenberg,et al. ON THE ORIGINS OF SPEECH INTELLIGIBILITY IN THE REAL WORLD , 1997 .