An Analysis of HMM-based prediction of articulatory movements

This paper presents an investigation into predicting the movement of a speaker's mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchronous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.

[1]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[2]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[3]  Korin Richmond Preliminary inversion mapping results with a new EMA corpus , 2009, INTERSPEECH.

[4]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[5]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[6]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[7]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[8]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[9]  Ren-Hua Wang,et al.  Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Takemi Mochida,et al.  Multi-speaker articulatory trajectory formation based on speaker-independent articulatory HMMs , 2006, Speech Commun..

[11]  Susan Fitt,et al.  Synthesis of regional English using a keyword lexicon , 1999, EUROSPEECH.

[12]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[13]  Korin Richmond,et al.  Trajectory Mixture Density Networks with Multiple Mixtures for Acoustic-Articulatory Inversion , 2007, NOLISP.

[14]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15]  Shigeru Kiritani,et al.  X-ray microbeam method for measurement of articulatory dynamics-techniques and results , 1986, Speech Commun..

[16]  S. Renals,et al.  Acoustic-Articulatory Modelling with the Trajectory HMM , 2007 .

[17]  J C Gore,et al.  Application of MRI to the analysis of speech production. , 1987, Magnetic resonance imaging.

[18]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[19]  C S Blackburn,et al.  A self-learning predictive model of articulator movements during speech production. , 2000, The Journal of the Acoustical Society of America.

[20]  P. Schönle,et al.  Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract , 1987, Brain and Language.

[21]  Abeer Alwan,et al.  Text to Speech Synthesis: New Paradigms and Advances , 2004 .

[22]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[24]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[25]  J. Hogg Magnetic resonance imaging. , 1994, Journal of the Royal Naval Medical Service.

[26]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.