Toward optimizing stream fusion in multistream recognition of speech.

A multistream phoneme recognition framework is proposed based on forming streams from different spectrotemporal modulations of speech. Phoneme posterior probabilities were estimated from each stream separately and combined at the output level. A statistical model of the final estimated posterior probabilities is used to characterize the system performance. During the operation, the best fusion architecture is chosen automatically to maximize the similarity of output statistics to clean condition. Results on phoneme recognition from noisy speech indicate the effectiveness of the proposed method.

[1]  James R. Glass,et al.  Heterogeneous acoustic measurements for phonetic classification 1 , 1997, EUROSPEECH.

[2]  Nelson Morgan,et al.  Multi-stream to many-stream: using spectro-temporal features for ASR , 2009, INTERSPEECH.

[3]  Harvey b. Fletcher,et al.  Speech and hearing in communication , 1953 .

[4]  James R. Glass,et al.  HETEROGENEOUS ACOUSTIC MEASUREMENTS FOR PHONETIC CLASSIFICATION , 1997 .

[5]  Hynek Hermansky,et al.  A multistream multiresolution framework for phoneme recognition , 2010, INTERSPEECH.

[6]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[7]  Nima Mesgarani,et al.  Phoneme representation and classification in primary auditory cortex. , 2008, The Journal of the Acoustical Society of America.

[8]  Misha Pavel,et al.  Towards ASR on partially corrupted speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[10]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[11]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[12]  Fabio Valente,et al.  Combination of Acoustic Classifiers Based on Dempster-Shafer Theory of Evidence , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Nima Mesgarani,et al.  Denoising in the Domain of Spectrotemporal Modulations , 2007, EURASIP J. Audio Speech Music. Process..

[14]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).