论文信息 - Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping

Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping

This paper addresses the adaptation of an acoustic-articulatory inversion model of a reference speaker to the voice of another source speaker, using a limited amount of audio-only data. In this study, the articulatory-acoustic relationship of the reference speaker is modeled by a Gaussian mixture model and inference of articulatory data from acoustic data is made by the associated Gaussian mixture regression (GMR). To address speaker adaptation, we previously proposed a general framework called Cascaded-GMR (C-GMR) which decomposes the adaptation process into two consecutive steps: spectral conversion between source and reference speaker and acoustic-articulatory inversion of converted spectral trajectories. In particular, we proposed the integrated C-GMR technique (IC-GMR) in which both steps are tied together in the same probabilistic model. In this paper, we extend the C-GMR framework with another model called Joint-GMR (J-GMR). Contrary to the IC-GMR, this model aims at exploiting all potential acoustic-articulatory relationships, including those between the source speaker's acoustics and the reference speaker's articulation. We present the full derivation of the exact expectation–maximization (EM) training algorithm for the J-GMR. It exploits the missing data methodology of machine learning to deal with limited adaptation data. We provide an extensive evaluation of the J-GMR on both synthetic acoustic-articulatory data and on the multispeaker MOCHA EMA database. We compare the J-GMR performance to other models of the C-GMR framework, notably the IC-GMR, and discuss their respective merits.

[1] Louis-Jean Boë,et al. Articulatory-acoustic relationships during vocal tract growth for French vowels: Analysis of real data and simulations with an articulatory model , 2007, J. Phonetics.

[2] Kaare Brandt Petersen,et al. The Matrix Cookbook , 2006 .

[3] H. Zen,et al. Continuous Stochastic Feature Mapping Based on Trajectory HMMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Radu Horaud,et al. Vision-guided robot hearing , 2013, Int. J. Robotics Res..

[5] Nicu Sebe,et al. Analyzing Free-standing Conversational Groups: A Multimodal Approach , 2015, ACM Multimedia.

[6] Laurent Girin,et al. Speaker-Adaptive Acoustic-Articulatory Inversion Using Cascaded Gaussian Mixture Regression , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7] An Ji,et al. Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8] Li Deng,et al. Vocal‐tract length normalization for acoustic‐to‐articulatory mapping using neural networks , 1999 .

[9] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[10] Olov Engwall,et al. Mapping between acoustic and articulatory gestures , 2011, Speech Commun..

[11] Mark J. F. Gales,et al. Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[12] Mark K. Tiede,et al. Vocal Tract Length Normalization for Speaker Independent Acoustic-to-Articulatory Speech Inversion , 2016, INTERSPEECH.

[13] Michael I. Jordan,et al. Learning from Incomplete Data , 1994 .

[14] Thenkurussi Kesavadas,et al. Generation of Handwriting by Active Shape Modeling and Global Local Approximation (GLA) Adaptation , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[15] Darwin G. Caldwell,et al. Learning and Reproduction of Gestures by Imitation , 2010, IEEE Robotics & Automation Magazine.

[16] Roland Kuhn,et al. Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[17] Gérard Bailly,et al. Speaker adaptation of an acoustic-articulatory inversion model using cascaded Gaussian mixture regressions , 2013, INTERSPEECH.

[18] Alan A Wrench,et al. A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[19] G. McLachlan,et al. The EM algorithm and extensions , 1996 .

[20] Keikichi Hirose,et al. Statistical acoustic-to-articulatory mapping unified with speaker normalization based on voice conversion , 2015, INTERSPEECH.

[21] P. Deb. Finite Mixture Models , 2008 .

[22] Korin Richmond,et al. Estimating articulatory parameters from the acoustic speech signal , 2002 .

[23] Yong Liu,et al. Latent Gaussian Mixture Regression for Human Pose Estimation , 2010, ACCV.

[24] Nicu Sebe,et al. Recognizing Emotions from Abstract Paintings Using Non-Linear Matrix Completion , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Eric Moulines,et al. Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[26] Eric L. Sauser,et al. An Approach Based on Hidden Markov Model and Gaussian Mixture Regression , 2010 .

[27] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[28] Keiichi Tokuda,et al. Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[29] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.