Acoustic-to-articulatory inversion using speech recognition and trajectory formation based on phoneme hidden Markov models

In order to recover the movements of usually hidden articulators such as tongue or velum, we have developed a data-based speech inversion method. HMMs are trained, in a multistream framework, from two synchronous streams: articulatory movements measured by EMA, and MFCC + energy from the speech signal. A speech recognition procedure based on the acoustic part of the HMMs delivers the chain of phonemes and together with their durations, information that is subsequently used by a trajectory formation procedure based on the articulatory part of the HMMs to synthesise the articulatory movements. The RMS reconstruction error ranged between 1.1 and 2. mm. Index Terms: Speech inversion, augmented speech, automatic speech recognition, HTK, Electro-Magnetic Articulography (EMA), hidden Markov model (HMM), trajectory formation, HTS.

[1]  Pierre Badin,et al.  Three-dimensional linear modeling of tongue: Articulatory data and models , 2006 .

[2]  Luciano Fadiga,et al.  The common language of speech perception and action: a neurocognitive perspective , 2008 .

[3]  Gérard Bailly,et al.  Synthesis of French fricatives by audio-video to articulatory inversion , 2001 .

[4]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[5]  Yves Laprie,et al.  Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion. , 2005, The Journal of the Acoustical Society of America.

[6]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[7]  Steve Young,et al.  The HTK book , 1995 .

[8]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Gérard Bailly,et al.  Learning optimal audiovisual phasing for an HMM-based control model for facial animation , 2007, SSW.

[10]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[11]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[12]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[13]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[14]  Hedvig Kjellström,et al.  Audiovisual-to-articulatory inversion , 2009, Speech Commun..

[15]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[16]  Petros Maragos,et al.  Face Active Appearance Modeling and Speech Acoustic Information to Recover Articulation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[18]  C. Fowler,et al.  Rapid access to speech gestures in perception: Evidence from choice and simple response time tasks. , 2003, Journal of memory and language.

[19]  Gérard Bailly,et al.  Can you "read tongue movements"? , 2008, INTERSPEECH.

[20]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[21]  Heiga Zen,et al.  An introduction of trajectory model into HMM-based speech synthesis , 2004, SSW.