Articulatory features from deep neural networks and their role in speech recognition

This paper presents a deep neural network (DNN) to extract articulatory information from the speech signal and explores different ways to use such information in a continuous speech recognition task. The DNN was trained to estimate articulatory trajectories from input speech, where the training data is a corpus of synthetic English words generated by the Haskins Laboratories' task-dynamic model of speech production. Speech parameterized as cepstral features were used to train the DNN, where we explored different cepstral features to observe their role in the accuracy of articulatory trajectory estimation. The best feature was used to train the final DNN system, where the system was used to predict articulatory trajectories for the training and testing set of Aurora-4, the noisy Wall Street Journal (WSJ0) corpus. This study also explored the use of hidden variables in the DNN pipeline as a potential acoustic feature candidate for speech recognition and the results were encouraging. Word recognition results from Aurora-4 indicate that the articulatory features from the DNN provide improvement in speech recognition performance when fused with other standard cepstral features; however when tried by themselves, they failed to match the baseline performance.

[1]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[2]  Carol Y. Espy-Wilson,et al.  Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[3]  Elliot Saltzman,et al.  Gesture-based Dynamic Bayesian Network for noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[5]  Andreas Stolcke,et al.  Articulatory features for large vocabulary speech recognition , 2013 .

[6]  Giorgio Metta,et al.  Relevance-weighted-reconstruction of articulatory features in deep-neural-network-based acoustic-to-articulatory mapping , 2013, INTERSPEECH.

[7]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[8]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[9]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[10]  Steve Renals,et al.  A Deep Neural Network for Acoustic-Articulatory Speech Inversion , 2011 .

[11]  K. Stevens Toward a Model for Speech Recognition , 1960 .

[12]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Raymond G. Daniloff,et al.  On defining coarticulation , 1973 .

[14]  Li Deng,et al.  Hidden Markov model representation of quantized articulatory features for speech recognition , 1993, Comput. Speech Lang..

[15]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[17]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[18]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[19]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Andreas Stolcke,et al.  Improving robustness of MLLR adaptation with speaker-clustered regression class trees , 2009, Comput. Speech Lang..

[21]  Martin Graciarena,et al.  Damped oscillator cepstral coefficients for robust speech recognition , 2013, INTERSPEECH.

[22]  Louis Goldstein,et al.  Towards an articulatory phonology , 1986, Phonology.

[23]  Arindam Mandal,et al.  Normalized amplitude modulation features for large vocabulary noise-robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[25]  Elliot Saltzman,et al.  Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies , 2010, IEEE Journal of Selected Topics in Signal Processing.

[26]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[27]  K. Stevens,et al.  A quasiarticulatory approach to controlling acoustic source parameters in a Klatt-type formant synthesizer using HLsyn. , 2002, The Journal of the Acoustical Society of America.