Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data.

[1]  William F. Katz,et al.  Augmented visual feedback in second language learning: Training Japanese post‐alveolar flaps to American English speakers , 2007 .

[2]  S. Renals,et al.  Acoustic-Articulatory Modelling with the Trajectory HMM , 2007 .

[3]  An Ji,et al.  The Electromagnetic Articulography Mandarin Accented English (EMA-MAE) corpus of acoustic and 3D articulatory kinematic data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Korin Richmond,et al.  Comparison of HMM and TMDN methods for lip synchronisation , 2010, INTERSPEECH.

[5]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[6]  Timothy J. Hazen A comparison of novel techniques for rapid speaker adaptation , 2000, Speech Commun..

[7]  Li Deng,et al.  An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. , 2002, The Journal of the Acoustical Society of America.

[8]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[9]  Tao Chen,et al.  Speaker selection training for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Jun Wang,et al.  Opti-speech: a real-time, 3d visual feedback system for speech training , 2014, INTERSPEECH.

[11]  Takayuki Ito,et al.  An EMA-based articulatory feedback approach to facilitate L2 speech production learning , 2013 .

[12]  Petros Faloutsos,et al.  Acquisition of the 3D surface of the palate by in-vivo digitization with Wave , 2012, Speech Commun..

[13]  Ren-Hua Wang,et al.  Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Olov Engwall Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher , 2012 .

[15]  Li Deng,et al.  Acoustic-To-Articulatory Inversion Using Dynamical and Phonological Constraints , 2017 .

[16]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[17]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  T Kaburagi,et al.  An ultrasonic method for monitoring tongue shape and the position of a fixed point on the tongue surface. , 1994, The Journal of the Acoustical Society of America.

[19]  V. Gracco,et al.  Accurate recovery of articulator positions from acoustics: new conclusions based on human data. , 1996, The Journal of the Acoustical Society of America.

[20]  Elliot Saltzman,et al.  Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies , 2010, IEEE Journal of Selected Topics in Signal Processing.

[21]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[22]  Gérard Bailly,et al.  Speaker adaptation of an acoustic-articulatory inversion model using cascaded Gaussian mixture regressions , 2013, INTERSPEECH.

[23]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[25]  Shrikanth S. Narayanan,et al.  A subject-independent acoustic-to-articulatory inversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Masaaki Honda,et al.  Speaker Adaptation Method for Acoustic-to-Articulatory Inversion using an HMM-Based Speech Production Model , 2004, IEICE Trans. Inf. Syst..