On Improving Dynamic State Space Approaches to Articulatory Inversion With MAP-Based Parameter Estimation

This paper presents a complete framework for articulatory inversion based on jump Markov linear systems (JMLS). In the model, the acoustic measurements and the position of each articulator are considered as observable measurement and continuous-valued hidden state of the system, respectively, and discrete regimes of the system are represented by the use of a discrete-valued hidden modal state. Articulatory inversion based on JMLS involves learning the model parameter set of the system and making inference about the state (position of each articulator) of the system using acoustic measurements. Iterative learning algorithms based on maximum-likelihood (ML) and maximum a posteriori (MAP) criteria are proposed to learn the model parameter set of the JMLS. It is shown that the learning procedure of the JMLS is a generalized version of hidden Markov model (HMM) training when both acoustic and articulatory data are given. In this paper, it is shown that the MAP-based learning algorithm improves modeling performance of the system and gives significantly better results compared to ML. The inference stage of the proposed algorithm is based on an interacting multiple models (IMM) approach, and done online (filtering), and/or offline (smoothing). Formulas are provided for IMM-based JMLS smoothing. It is shown that smoothing significantly improves the performance of articulatory inversion compared to filtering. Several experiments are conducted with the MOCHA database to show the performance of the proposed method. Comparison of the performance of the proposed method with the ones given in the literature shows that the proposed method improves the performance of state space approaches, making state space approaches comparable to the best published results.

[1]  Tolga Çiloglu,et al.  The use of articulator motion information in automatic speech segmentation , 2008, Speech Commun..

[2]  C. Striebel,et al.  On the maximum likelihood estimates for linear dynamic systems , 1965 .

[3]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[4]  D. Ostry,et al.  The equilibrium point hypothesis and its application to speech motor control. , 1996, Journal of speech and hearing research.

[5]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[6]  Mitsuo Kawato,et al.  Equilibrium-Point Control Hypothesis Examined by Measured Arm Stiffness During Multijoint Movement , 1996, Science.

[7]  Mark Hasegawa-Johnson,et al.  Formant trajectories for acoustic-to-articulatory inversion , 2009, INTERSPEECH.

[8]  Hedvig Kjellström,et al.  Audiovisual-to-articulatory inversion , 2009, Speech Commun..

[9]  Paul W. Fieguth,et al.  A functional articulatory dynamic model for speech production , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Ren-Hua Wang,et al.  Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  M. Schroeder Determination of the geometry of the human vocal tract by acoustic measurements. , 1967, The Journal of the Acoustical Society of America.

[12]  William Dale Blair,et al.  Fixed-interval smoothing for Markovian switching systems , 1995, IEEE Trans. Inf. Theory.

[13]  Li Deng,et al.  A mixed-level switching dynamic system for continuous speech recognition , 2004, Comput. Speech Lang..

[14]  Dong Yu,et al.  Speaker-adaptive learning of resonance targets in a hidden trajectory model of speech coarticulation , 2007, Comput. Speech Lang..

[15]  T. Chiba The vowel, its nature and structure , 1958 .

[16]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[17]  Carol Y. Espy-Wilson,et al.  From acoustics to Vocal Tract time functions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Shigeru Katagiri,et al.  A theoretical analysis of speech recognition based on feature trajectory models , 2004, INTERSPEECH.

[19]  Jianwu Dang,et al.  Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework , 2006, Speech Commun..

[20]  I R Titze,et al.  Twitch response in the canine vocalis muscle. , 1985, Journal of speech and hearing research.

[21]  L. Baum,et al.  Growth transformations for functions on manifolds. , 1968 .

[22]  Petros Maragos,et al.  Face Active Appearance Modeling and Speech Acoustic Information to Recover Articulation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  P. Mermelstein Determination of the vocal-tract shape from measured formant frequencies. , 1967, The Journal of the Acoustical Society of America.

[24]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[25]  Nando de Freitas,et al.  Fast particle smoothing: if I had a million particles , 2006, ICML.

[26]  Yves Laprie,et al.  Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion. , 2005, The Journal of the Acoustical Society of America.

[27]  Geoffrey E. Hinton,et al.  Variational Learning for Switching State-Space Models , 2000, Neural Computation.

[28]  Korin Richmond,et al.  Trajectory Mixture Density Networks with Multiple Mixtures for Acoustic-Articulatory Inversion , 2007, NOLISP.

[29]  Paul W. Fieguth,et al.  A multimodal variational approach to learning and inference in switching state space models [speech processing application] , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Ingo R. Titze,et al.  Twitch response in the canine vocalis muscle. , 1985, Journal of speech and hearing research.

[31]  Simon King,et al.  Speech Recognition Using Linear Dynamic Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[33]  Chang‐Jin Kim,et al.  Dynamic linear models with Markov-switching , 1994 .

[34]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[35]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[36]  Leo Jingyu Lee,et al.  Hidden Dynamic Models for Speech Processing Applications , 2004 .

[37]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[38]  S. Renals,et al.  Acoustic-Articulatory Modelling with the Trajectory HMM , 2007 .

[39]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[40]  Yaakov Bar-Shalom,et al.  Estimation and Tracking: Principles, Techniques, and Software , 1993 .

[41]  Konstantinos G. Margaritis,et al.  Contribution to statistical acoustic-to-EMA mapping , 2008, 2008 16th European Signal Processing Conference.

[42]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[43]  Miguel Á. Carreira-Perpiñán,et al.  A comparison of acoustic features for articulatory inversion , 2007, INTERSPEECH.

[44]  Elaine Martin,et al.  Bayesian linear regression and variable selection for spectroscopic calibration. , 2009, Analytica chimica acta.

[45]  Steve Young,et al.  The HTK book , 1995 .

[46]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[47]  H. Sorenson,et al.  Recursive bayesian estimation using gaussian sums , 1971 .

[48]  W. Michael Conklin,et al.  Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing , 2005, Technometrics.

[49]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[50]  Petros Maragos,et al.  Audiovisual speech inversion by switching dynamical modeling governed by a Hidden Markov process , 2008, 2008 16th European Signal Processing Conference.

[51]  Kevin Murphy,et al.  Switching Kalman Filters , 1998 .

[52]  Alexander S. Leonov,et al.  Estimation of stability and accuracy of inverse problem solution for the vocal tract , 2000, Speech Commun..

[53]  Gm Gero Walter,et al.  Bayesian linear regression , 2009 .

[54]  Mübeccel Demirekler,et al.  ML vs. Map parameter estimation of linear dynamic systems for acoustic-to-articulatory inversion: A comparative study , 2010, 2010 18th European Signal Processing Conference.