Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping

This paper addresses the adaptation of an acoustic-articulatory inversion model of a reference speaker to the voice of another source speaker, using a limited amount of audio-only data. In this study, the articulatory-acoustic relationship of the reference speaker is modeled by a Gaussian mixture model and inference of articulatory data from acoustic data is made by the associated Gaussian mixture regression (GMR). To address speaker adaptation, we previously proposed a general framework called Cascaded-GMR (C-GMR) which decomposes the adaptation process into two consecutive steps: spectral conversion between source and reference speaker and acoustic-articulatory inversion of converted spectral trajectories. In particular, we proposed the integrated C-GMR technique (IC-GMR) in which both steps are tied together in the same probabilistic model. In this paper, we extend the C-GMR framework with another model called Joint-GMR (J-GMR). Contrary to the IC-GMR, this model aims at exploiting all potential acoustic-articulatory relationships, including those between the source speaker's acoustics and the reference speaker's articulation. We present the full derivation of the exact expectation–maximization (EM) training algorithm for the J-GMR. It exploits the missing data methodology of machine learning to deal with limited adaptation data. We provide an extensive evaluation of the J-GMR on both synthetic acoustic-articulatory data and on the multispeaker MOCHA EMA database. We compare the J-GMR performance to other models of the C-GMR framework, notably the IC-GMR, and discuss their respective merits.

[1]  Louis-Jean Boë,et al.  Articulatory-acoustic relationships during vocal tract growth for French vowels: Analysis of real data and simulations with an articulatory model , 2007, J. Phonetics.

[2]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[3]  H. Zen,et al.  Continuous Stochastic Feature Mapping Based on Trajectory HMMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Radu Horaud,et al.  Vision-guided robot hearing , 2013, Int. J. Robotics Res..

[5]  Nicu Sebe,et al.  Analyzing Free-standing Conversational Groups: A Multimodal Approach , 2015, ACM Multimedia.

[6]  Laurent Girin,et al.  Speaker-Adaptive Acoustic-Articulatory Inversion Using Cascaded Gaussian Mixture Regression , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  An Ji,et al.  Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Li Deng,et al.  Vocal‐tract length normalization for acoustic‐to‐articulatory mapping using neural networks , 1999 .

[9]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[10]  Olov Engwall,et al.  Mapping between acoustic and articulatory gestures , 2011, Speech Commun..

[11]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[12]  Mark K. Tiede,et al.  Vocal Tract Length Normalization for Speaker Independent Acoustic-to-Articulatory Speech Inversion , 2016, INTERSPEECH.

[13]  Michael I. Jordan,et al.  Learning from Incomplete Data , 1994 .

[14]  Thenkurussi Kesavadas,et al.  Generation of Handwriting by Active Shape Modeling and Global Local Approximation (GLA) Adaptation , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[15]  Darwin G. Caldwell,et al.  Learning and Reproduction of Gestures by Imitation , 2010, IEEE Robotics & Automation Magazine.

[16]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[17]  Gérard Bailly,et al.  Speaker adaptation of an acoustic-articulatory inversion model using cascaded Gaussian mixture regressions , 2013, INTERSPEECH.

[18]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[19]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[20]  Keikichi Hirose,et al.  Statistical acoustic-to-articulatory mapping unified with speaker normalization based on voice conversion , 2015, INTERSPEECH.

[21]  P. Deb Finite Mixture Models , 2008 .

[22]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[23]  Yong Liu,et al.  Latent Gaussian Mixture Regression for Human Pose Estimation , 2010, ACCV.

[24]  Nicu Sebe,et al.  Recognizing Emotions from Abstract Paintings Using Non-Linear Matrix Completion , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[26]  Eric L. Sauser,et al.  An Approach Based on Hidden Markov Model and Gaussian Mixture Regression , 2010 .

[27]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[28]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[29]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.