Articulatory trajectories for large-vocabulary speech recognition

Studies have demonstrated that articulatory information can model speech variability effectively and can potentially help to improve speech recognition performance. Most of the studies involving articulatory information have focused on effectively estimating them from speech, and few studies have actually used such features for speech recognition. Speech recognition studies using articulatory information have been mostly confined to digit or medium vocabulary speech recognition, and efforts to incorporate them into large vocabulary systems have been limited. We present a neural network model to estimate articulatory trajectories from speech signals where the model was trained using synthetic speech signals generated by Haskins Laboratories' task-dynamic model of speech production. The trained model was applied to natural speech, and the estimated articulatory trajectories obtained from the models were used in conjunction with standard cepstral features to train acoustic models for large-vocabulary recognition systems. Two different large-vocabulary English datasets were used in the experiments reported here. Results indicate that employing articulatory information improves speech recognition performance not only under clean conditions but also under noisy background conditions. Perceptually motivated robust features were also explored in this study and the best performance was obtained when systems based on articulatory, standard cepstral and perceptually motivated feature were all combined.

[1]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[2]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[3]  K. Stevens,et al.  A quasiarticulatory approach to controlling acoustic source parameters in a Klatt-type formant synthesizer using HLsyn. , 2002, The Journal of the Acoustical Society of America.

[4]  Arindam Mandal,et al.  Normalized amplitude modulation features for large vocabulary noise-robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Carol Y. Espy-Wilson,et al.  Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Andreas Stolcke,et al.  Improving robustness of MLLR adaptation with speaker-clustered regression class trees , 2009, Comput. Speech Lang..

[8]  Li Deng,et al.  Hidden Markov model representation of quantized articulatory features for speech recognition , 1993, Comput. Speech Lang..

[9]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[10]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[11]  Mark J. F. Gales,et al.  Training LVCSR systems on thousands of hours of data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Elliot Saltzman,et al.  Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies , 2010, IEEE Journal of Selected Topics in Signal Processing.

[13]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  Bernd Lochschmidt,et al.  Acoustic-Phonetic Analysis Based on an Articulatory Model , 1982 .

[16]  Raymond G. Daniloff,et al.  On defining coarticulation , 1973 .

[17]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[18]  Ronald A. Cole,et al.  Performing fine phonetic distinctions: templates versus features , 1990 .

[19]  K. Stevens Toward a Model for Speech Recognition , 1960 .

[20]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[21]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[22]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[24]  O. Schmidbauer Robust statistic modelling of systematic variabilities in continuous speech incorporating acoustic-articulatory relations , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[25]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[26]  Pietro Laface,et al.  Automatic detection and description of syllabic features in continuous speech , 1976 .

[27]  Louis Goldstein,et al.  Towards an articulatory phonology , 1986, Phonology.