Cross-corpus and cross-linguistic evaluation of a speaker-dependent DNN-HMM ASR system using EMA data

We test an hybrid Deep Neural Network Hidden Markov Model (DNN-HMM) phone recognition system that uses measured articulatory features as additional observations on two English corpora and an Italian corpus. The three corpora contain simultaneous recordings of speech acoustics and EMA (Electromagnetic Articulograph) data. We show that the additional articulatory features reconstructed from speech acoustics through an Acoustic-to-Articulatory Mapping, always produce a phone error reduction, with the exception of one single case where, however, the reconstruction accuracy of the articulatory features is significantly lower than in all other cases. Error analysis shows that in all corpora the articulatory features positively affect the discrimination of almost all phonemes although some phonemic categories are clearly more affected than others.

[1]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[2]  Giorgio Metta,et al.  Deep-level acoustic-to-articulatory mapping for DBN-HMM based phone recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[3]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[4]  Simon King,et al.  An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[6]  Mark Hasegawa-Johnson,et al.  Estimation of Articulatory Trajectories Based on Gaussian Mixture Model (GMM) With Audio-Visual Information Fusion and Dynamic Kalman Smoothing , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[8]  Jianwu Dang,et al.  Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework , 2006, Speech Commun..

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Steve Renals,et al.  Deep Architectures for Articulatory Inversion , 2012, INTERSPEECH.

[11]  Giulio Sandini,et al.  The Use of Phonetic Motor Invariants Can Improve Automatic Phoneme Discrimination , 2011, PloS one.

[12]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[13]  Phil Hoole,et al.  Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus , 2011, INTERSPEECH.

[14]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[15]  Ellen Eide Distinctive features for use in an automatic speech recognition system , 2001, INTERSPEECH.

[16]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Korin Richmond Preliminary inversion mapping results with a new EMA corpus , 2009, INTERSPEECH.

[19]  Giorgio Metta,et al.  Relevance-weighted-reconstruction of articulatory features in deep-neural-network-based acoustic-to-articulatory mapping , 2013, INTERSPEECH.

[20]  Giulio Sandini,et al.  New Technologies for Simultaneous Acquisition of Speech Articulatory Data : 3 D Articulograph , Ultrasound and Electroglottograph , 2008 .