Noise robustness of tract variables and their application to speech recognition

This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22% and 21% improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech.

[1]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[2]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[3]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[4]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[5]  Sorin Dusan Methods for Integrating Phonetic and Phonological Knowledge in Speech Inversion , 2001 .

[6]  Jeffrey M. Zacks,et al.  A new neural network for articulatory speech recognition and its application to vowel identification , 1994, Comput. Speech Lang..

[7]  C. Espy-Wilson,et al.  A step in the realization of a speech recognition system based on gestural phonology and landmarks. , 2009 .

[8]  Richard S. McGowan,et al.  Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests , 1994, Speech Commun..

[9]  Abeer Alwan,et al.  A noise-type and level-dependent MPO-based speech enhancement architecture with variable frame analysis for noise-robust speech recognition , 2009, INTERSPEECH.

[10]  J. Ryalls,et al.  Introduction to Speech Science : From Basic Theories to Clinical Applications , 2003 .

[11]  Carol Y. Espy-Wilson,et al.  Speech enhancement using modified phase opponency model , 2007, INTERSPEECH.

[12]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[13]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[14]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[15]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[16]  Louis Goldstein,et al.  Articulatory gestures as phonological units , 1989, Phonology.

[17]  Li Deng,et al.  Statistical estimation of articulatory trajectories from the speech signal using dynamical and phonological constraints , 2000 .