Acoustic-to-articulatory inversion mapping with Gaussian mixture model

This paper describes the acoustic-to-articulatory inversion mapping using a Gaussian Mixture Model (GMM). Correspondence of an acoustic parameter and an articulatory parameter is modeled by the GMM trained using the parallel acousticarticulatory data. We measure the performance of the GMMbased mapping and investigate the effectiveness of using multiple acoustic frames as an input feature and using multiple mixtures. As a result, it is shown that although increasing the number of mixtures is useful for reducing the estimation error, it causes many discontinuities in the estimated articulatory trajectories. In order to address this problem, we apply maximum likelihood estimation (MLE) considering articulatory dynamic features to the GMM-based mapping. Experimental results demonstrate that the MLE using dynamic features can estimate more appropriate articulatory movements compared with the GMM-based mapping applied smoothing by lowpass filter.

[1]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[2]  Mohan Sondhi Articulatory modeling: a possible role in concatenative text-to-speech synthesis , 2002 .

[3]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[5]  Juergen Schroeter,et al.  Speech coding based on physiological models of speech production , 1992 .

[6]  Masaaki Honda,et al.  Determination of articulatory movements from speech acoustics using an HMM-based speech production model , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[8]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[9]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[10]  Masaaki Honda,et al.  Acoustic-to-articulatory inverse mapping using an HMM-based speech production model , 2002, INTERSPEECH.

[11]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .