Acoustic-to-Articulatory Inversion Mapping Based on Latent Trajectory Gaussian Mixture Model

A maximum likelihood parameter trajectory estimation based on a Gaussian mixture model (GMM) has been successfully implemented for acoustic-to-articulatory inversion mapping. In the conventional method, GMM parameters are optimized by maximizing a likelihood function for joint static and dynamic features of acoustic-articulatory data, and then, the articulatory parameter trajectories are estimated for given the acoustic data by maximizing a likelihood function for only the static features, imposing a constraint between static and dynamic features to consider the inter-frame correlation. Due to the inconsistency of the training and mapping criterion, the trained GMM is not optimum for the mapping process. This inconsistency problem is addressed within a trajectory training framework, but it becomes more difficult to optimize some parameters, e.g., covariance matrices and mixture component sequences. In this paper, we propose an inversion mapping method based on a latent trajectory GMM (LT-GMM) as yet another way to overcome the inconsistency issue. The proposed method makes it possible to use a well-formulated algorithm, such as EM algorithm, to optimize the LT-GMM parameters, which is not feasible in the traditional trajectory training. Experimental results demonstrate that the proposed method yields higher accuracy in the inversion mapping compared to the conventional GMM-based method.

[1]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[2]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[3]  Patrick Amestoy,et al.  Hybrid scheduling for the parallel solution of linear systems , 2006, Parallel Comput..

[4]  Tomoki Toda,et al.  Trajectory training considering global variance for HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Ren-Hua Wang,et al.  Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[7]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[8]  I. Daum,et al.  Kinematic analysis of articulatory movements in central motor disorders , 1997, Movement disorders : official journal of the Movement Disorder Society.

[9]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[10]  H. Ackermann,et al.  Articulatory deficits in parkinsonian dysarthria: an acoustic analysis. , 1991, Journal of neurology, neurosurgery, and psychiatry.

[11]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[12]  Tomoki Toda,et al.  Articulatory controllable speech modification based on Gaussian mixture models with direct waveform modification using spectrum differential , 2015, INTERSPEECH.

[13]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[14]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[15]  Slim Ouni,et al.  Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis , 2012, FAA '12.

[16]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Juergen Schroeter,et al.  Speech coding based on physiological models of speech production , 1992 .

[18]  Tomoki Toda,et al.  Modulation spectrum-constrained trajectory training algorithm for GMM-based Voice Conversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Cecil H. Coker,et al.  Articulatory analysis and synthesis of speech , 1989, Fourth IEEE Region 10 International Conference TENCON.

[20]  Hirokazu Kameoka,et al.  Modeling speech parameter sequences with latent trajectory Hidden Markov model , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[21]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[22]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[23]  Masaaki Honda,et al.  Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints , 1998, ICSLP.

[24]  V. Gracco,et al.  Accurate recovery of articulator positions from acoustics: new conclusions based on human data. , 1996, The Journal of the Acoustical Society of America.