Recognizing articulatory gestures from speech for robust speech recognition.

Studies have shown that supplementary articulatory information can help to improve the recognition rate of automatic speech recognition systems. Unfortunately, articulatory information is not directly observable, necessitating its estimation from the speech signal. This study describes a system that recognizes articulatory gestures from speech, and uses the recognized gestures in a speech recognition system. Recognizing gestures for a given utterance involves recovering the set of underlying gestural activations and their associated dynamic parameters. This paper proposes a neural network architecture for recognizing articulatory gestures from speech and presents ways to incorporate articulatory gestures for a digit recognition task. The lack of natural speech database containing gestural information prompted us to use three stages of evaluation. First, the proposed gestural annotation architecture was tested on a synthetic speech dataset, which showed that the use of estimated tract-variable-time-functions improved gesture recognition performance. In the second stage, gesture-recognition models were applied to natural speech waveforms and word recognition experiments revealed that the recognized gestures can improve the noise-robustness of a word recognition system. In the final stage, a gesture-based Dynamic Bayesian Network was trained and the results indicate that incorporating gestural information can improve word recognition performance compared to acoustic-only systems.

[1]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[2]  Dani Byrd,et al.  Analysis of pausing behavior in spontaneous speech using real-time magnetic resonance imaging of articulation. , 2009, The Journal of the Acoustical Society of America.

[3]  Marco Iacoboni,et al.  The Essential Role of Premotor Cortex in Speech Perception , 2007, Current Biology.

[4]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[5]  Shrikanth S. Narayanan,et al.  Estimation of articulatory gesture patterns from speech acoustics , 2009, INTERSPEECH.

[6]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[7]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[8]  K. Stevens Toward a Model for Speech Recognition , 1960 .

[9]  Mark Hasegawa-Johnson,et al.  Articulatory phonological code for word classification , 2009, INTERSPEECH.

[10]  Simon King,et al.  A hybrid ANN/DBN approach to articulatory feature recognition , 2005, INTERSPEECH.

[11]  Mark Hasegawa-Johnson,et al.  A procedure for estimating gestural scores from natural speech , 2010, INTERSPEECH.

[12]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[13]  Carol Y. Espy-Wilson,et al.  Speech recognition based on phonetic features and acoustic landmarks , 2004 .

[14]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[15]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[16]  Haizhou Li,et al.  A Study on the Generalization Capability of Acoustic Models for Robust Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Zoubin Ghahramani,et al.  Learning Dynamic Bayesian Networks , 1997, Summer School on Neural Networks.

[18]  O. Schmidbauer Robust statistic modelling of systematic variabilities in continuous speech incorporating acoustic-articulatory relations , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[19]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[20]  Louis Goldstein,et al.  Articulatory gestures as phonological units , 1989, Phonology.

[21]  Richard S. McGowan,et al.  Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests , 1994, Speech Commun..

[22]  Olov Engwall,et al.  The acoustic to articulation mapping: non-linear or non-unique? , 2008, INTERSPEECH.

[23]  Xiuyang Yu,et al.  What kind of pronunciation variation is hard for triphones to model? , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  D. Hammerstrom,et al.  Neural networks at work , 1993, IEEE Spectrum.

[25]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[26]  L. Fadiga,et al.  The Motor Somatotopy of Speech Perception , 2009, Current Biology.

[27]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[28]  Thomas Baer,et al.  An articulatory synthesizer for perceptual research , 1978 .

[29]  Konstantinos G. Margaritis,et al.  A support vector approach to the acoustic-to-articulatory mapping , 2005, INTERSPEECH.

[30]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Mark Hasegawa-Johnson,et al.  FSM-based pronunciation modeling using articulatory phonological code , 2010, INTERSPEECH.

[32]  Jason Weston,et al.  Scaling Learning Algorithms toward AI , 2007 .

[33]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[34]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[35]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[36]  Tzyy-Ping Jung,et al.  Deriving gestural score from articulator-movement records using weighted temporal decomposition , 1996, IEEE Trans. Speech Audio Process..

[37]  Elliot Saltzman,et al.  Speech inversion: Benefits of tract variables over pellet trajectories , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  K. Stevens,et al.  A quasiarticulatory approach to controlling acoustic source parameters in a Klatt-type formant synthesizer using HLsyn. , 2002, The Journal of the Acoustical Society of America.

[39]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[40]  Elliot Saltzman,et al.  Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies , 2010, IEEE Journal of Selected Topics in Signal Processing.

[41]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[42]  Abeer Alwan,et al.  Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR , 2005, IEEE Transactions on Speech and Audio Processing.

[43]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[44]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[45]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[46]  Miguel Á. Carreira-Perpiñán,et al.  An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping , 2007, INTERSPEECH.

[47]  Simon King,et al.  Articulatory feature recognition using dynamic Bayesian networks , 2007, Comput. Speech Lang..

[48]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[50]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[51]  Carol Y. Espy-Wilson,et al.  From acoustics to Vocal Tract time functions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Korin Richmond,et al.  Trajectory Mixture Density Networks with Multiple Mixtures for Acoustic-Articulatory Inversion , 2007, NOLISP.

[53]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[54]  Nathalie Virag,et al.  Single channel speech enhancement based on masking properties of the human auditory system , 1999, IEEE Trans. Speech Audio Process..

[55]  Li Deng,et al.  An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. , 2002, The Journal of the Acoustical Society of America.

[56]  Edward Jones,et al.  Combined speech enhancement and auditory modelling for robust distributed speech recognition , 2008, Speech Commun..

[57]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[58]  Ariel Salomon,et al.  Use of temporal information: detection of periodicity, aperiodicity, and pitch in speech , 2005, IEEE Transactions on Speech and Audio Processing.

[59]  Jianwu Dang,et al.  Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework , 2006, Speech Commun..

[60]  Tarun Pruthi Analysis, vocal-tract modeling and automatic detection of vowel nasalization , 2007 .

[61]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..