Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework

Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.

[1]  Mark Hasegawa-Johnson,et al.  Articulatory phonological code for word classification , 2009, INTERSPEECH.

[2]  Mark Hasegawa-Johnson,et al.  A procedure for estimating gestural scores from natural speech , 2010, INTERSPEECH.

[3]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[6]  Elliot Saltzman,et al.  Robust word recognition using articulatory trajectories and gestures , 2010, INTERSPEECH.

[7]  Li Deng,et al.  A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal , 1992, Signal Process..

[8]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[9]  Elliot Saltzman,et al.  Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies , 2010, IEEE Journal of Selected Topics in Signal Processing.

[10]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[11]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[12]  Louis Goldstein,et al.  Gestural specification using dynamically-defined articulatory structures , 1990 .

[13]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[14]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[15]  Abeer Alwan,et al.  Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[17]  Raymond G. Daniloff,et al.  On defining coarticulation , 1973 .

[18]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[19]  Li Deng,et al.  Hidden Markov model representation of quantized articulatory features for speech recognition , 1993, Comput. Speech Lang..

[20]  Elliot Saltzman,et al.  Speech inversion: Benefits of tract variables over pellet trajectories , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Louis Goldstein,et al.  Articulatory gestures as phonological units , 1989, Phonology.

[22]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[23]  Haizhou Li,et al.  A Study on the Generalization Capability of Acoustic Models for Robust Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  O. Schmidbauer Robust statistic modelling of systematic variabilities in continuous speech incorporating acoustic-articulatory relations , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[25]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[26]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[27]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[28]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Carol Y. Espy-Wilson,et al.  From acoustics to Vocal Tract time functions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[31]  Elliot Saltzman,et al.  Gesture-based Dynamic Bayesian Network for noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).