Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments

Robust speech recognition under varying acoustic conditions may be achieved by exploiting multiple sources of information in the speech signal. In addition to an acoustic signal representation, we use an articulatory representation consisting of pseudo-articulatory features as an additional information source. Hybrid ANN/HMM recognizers using either of these representations are evaluated on a continuous numbers recognition task (OGI Numbers95) under clean, reverberant and noisy conditions. An error analysis of preliminary recognition results shows that the different representations produce qualitatively different errors, which suggests a combination of both representations. We investigate various combination possibilities at the phoneme estimation level and show that significant improvements can been achieved under all three acoustic conditions.

[1]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[2]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Partha Niyogi,et al.  Incorporating voice onset time to improve letter recognition accuracies , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[5]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[7]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[8]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Jeffrey M. Zacks,et al.  A new neural network for articulatory speech recognition and its application to vowel identification , 1994, Comput. Speech Lang..

[10]  George H. Freeman,et al.  An HMM‐based speech recognizer using overlapping articulatory features , 1996 .

[11]  Kevin W. Bowyer,et al.  Combination of multiple classifiers using local accuracy estimates , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Brian Kan-Wing Mak Combining ANNs to improve phone recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[14]  Kjell Elenius,et al.  Phoneme recognition with an artificial neural network , 1991, EUROSPEECH.