Toward a Multi-Speaker Visual Articulatory Feedback System

In this paper, we present recent developments on the HMMbased acoustic-to-articulatory inversion approach that we develop for a “visual articulatory feedback” system. In this approach, multi-stream phoneme HMMs are trained jointly on synchronous streams of acoustic and articulatory data, acquired by electromagnetic articulography (EMA). Acousticto-articulatory inversion is achieved in two steps. Phonetic and state decoding is first performed. Then articulatory trajectories are inferred from the decoded phone and state sequence using the maximum-likelihood parameter generation algorithm (MLPG). We introduce here a new procedure for the reestimation of the HMM parameters, based on the Minimum Generation Error criterion (MGE). We also investigate the use of model adaptation techniques based on maximum likelihood linear regression (MLLR), as a first step toward a multispeaker visual articulatory feedback system. Index Terms: Acoustic-articulatory inversion, ElectroMagnetic Articulography (EMA), Hidden Markov Model (HMM), Minimum Generation Error (MGE), Speaker adaptation, Maximum Likelihood Linear Regression (MLLR).

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Gérard Bailly,et al.  Visual articulatory feedback for phonetic correction in second language learning , 2010 .

[3]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4]  Konstantinos G. Margaritis,et al.  A support vector approach to the acoustic-to-articulatory mapping , 2005, INTERSPEECH.

[5]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[6]  Zhen-Hua Ling,et al.  An Analysis of HMM-based prediction of articulatory movements , 2010, Speech Commun..

[7]  Korin Richmond,et al.  Trajectory Mixture Density Networks with Multiple Mixtures for Acoustic-Articulatory Inversion , 2007, NOLISP.

[8]  Gérard Bailly,et al.  Acoustic-to-articulatory inversion in speech based on statistical models , 2010, AVSP.

[9]  Akinori Ito,et al.  A speaker adaptation method for non-native speech using learners' native utterances for computer-assisted language learning systems , 2009, Speech Commun..

[10]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[11]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[12]  Wu Guo,et al.  Minimum generation error criterion for tree-based clustering of context dependent HMMs , 2006, INTERSPEECH.