Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition

Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the articulatory and acoustic space increase complexity of speech-to-articulatory mapping, which is already an ill-posed problem due to its inherent nonlinearity and non-unique nature. This work explores using deep neural networks (DNNs) and convolutional neural networks (CNNs) for mapping speech data into its corresponding articulatory space. Our speech-inversion results indicate that the CNN models perform better than their DNN counterparts. In addition, we use these inverse-models to generate articulatory information from speech for two separate speech recognition tasks: the WSJ1 and Aurora-4 continuous speech recognition tasks. This work proposes a hybrid convolutional neural network (HCNN), where two parallel layers are used to jointly model the acoustic and articulatory spaces, and the decisions from the parallel layers are fused at the output context-dependent (CD) state level. The acoustic model performs time-frequency convolution on filterbank-energy-level features, whereas the articulatory model performs time convolution on the articulatory features. The performance of the proposed architecture is compared to that of the CNN- and DNN-based systems using gammatone filterbank energies as acoustic features, and the results indicate that the HCNN-based model demonstrates lower word error rates compared to the CNN/DNN baseline systems.

[1]  Yun Lei,et al.  Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions , 2014, INTERSPEECH.

[2]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[3]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[4]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[5]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[7]  Wen Wang,et al.  Deep convolutional nets and robust features for reverberation-robust speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[8]  K. Stevens Toward a Model for Speech Recognition , 1960 .

[9]  Elliot Saltzman,et al.  Analysis of coarticulated speech using estimated articulatory trajectories , 2015, INTERSPEECH.

[10]  Shrikanth S. Narayanan,et al.  A subject-independent acoustic-to-articulatory inversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Horacio Franco,et al.  Time-frequency convolutional networks for robust speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12]  John H. L. Hansen,et al.  Fusion Strategies for Robust Speech Recognition and Keyword Spotting for Channel- and Noise-Degraded Speech , 2016, INTERSPEECH.

[13]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Martin Graciarena,et al.  Damped oscillator cepstral coefficients for robust speech recognition , 2013, INTERSPEECH.

[16]  Tara N. Sainath,et al.  Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[17]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[18]  Carol Y. Espy-Wilson,et al.  Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Elliot Saltzman,et al.  Gesture-based Dynamic Bayesian Network for noise robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Elliot Saltzman,et al.  Articulatory features from deep neural networks and their role in speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[23]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[25]  Giorgio Metta,et al.  Integrating articulatory data in deep neural network-based acoustic modeling , 2016, Comput. Speech Lang..

[26]  Andreas Stolcke,et al.  Articulatory features for large vocabulary speech recognition , 2013 .

[27]  Giorgio Metta,et al.  Relevance-weighted-reconstruction of articulatory features in deep-neural-network-based acoustic-to-articulatory mapping , 2013, INTERSPEECH.

[28]  Arindam Mandal,et al.  Normalized amplitude modulation features for large vocabulary noise-robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[30]  K. Stevens,et al.  A quasiarticulatory approach to controlling acoustic source parameters in a Klatt-type formant synthesizer using HLsyn. , 2002, The Journal of the Acoustical Society of America.

[31]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[32]  Louis Goldstein,et al.  Articulatory gestures as phonological units , 1989, Phonology.

[33]  Steve Renals,et al.  A Deep Neural Network for Acoustic-Articulatory Speech Inversion , 2011 .

[34]  Raymond G. Daniloff,et al.  On defining coarticulation , 1973 .

[35]  Elliot Saltzman,et al.  Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies , 2010, IEEE Journal of Selected Topics in Signal Processing.

[36]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[37]  Richard S. McGowan,et al.  Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests , 1994, Speech Commun..