Robust speech recognition combining cepstral and articulatory features

In this paper, a nonlinear relationship between pronunciation and auditory perception is introduced into speech recognition, and superior robustness is shown in the results. The Extreme Learning Machine mapping the relations was trained with Mocha-TIMIT database. Articulatory Features (AFs) were obtained by the network and MFCCs were fused for training acoustic model-DNN-HMM and GMM-HMM in this experiment. It has an 117.0% relative increment of WER with MFCCs-AFs-GMM-HMM while 125.6% with MFCCs-GMM-HMM And the performance of the model DNN-HMM is better than that of the model GMM-HMM, both with relative and absolute performance.

[1]  Guang-Bin Huang,et al.  Extreme learning machine: a new learning scheme of feedforward neural networks , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[2]  Neethu Mariam Joy,et al.  Articulatory Feature Extraction Using CTC to Build Articulatory Classifiers Without Forced Frame Alignments for Speech Recognition , 2016, INTERSPEECH.

[3]  Andreas Stolcke,et al.  Articulatory trajectories for large-vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Simon King,et al.  Manual Transcription of Conversational Speech at the Articulatory Feature Level , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Steve Renals,et al.  Revisiting hybrid and GMM-HMM system combination techniques , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[7]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[9]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[10]  Li Deng,et al.  Hidden Markov model representation of quantized articulatory features for speech recognition , 1993, Comput. Speech Lang..

[11]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[12]  Hongming Zhou,et al.  Extreme Learning Machine for Regression and Multiclass Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Dong Yu,et al.  Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Simon King,et al.  An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.