Articulatory Controllable Speech Modification Based on Statistical Inversion and Production Mappings

In this paper, we present an innovative way of utilizing the natural relationship between speech sounds and articulatory movements by developing an articulatory controllable speech modification system. Specifically, we employ statistical acoustic-to-articulatory inversion mapping and articulatory-to-acoustic production mapping based on a Gaussian mixture model, allowing flexible modification of the model parameters and the independence of the text input features. Modification of an input speech signal through manipulation of the unobserved articulatory movements is achievable through a sequence of inversion and production mappings. To ensure the naturalness of articulatory movement trajectories, we introduce a method for manipulating articulatory parameters by considering their intercorrelation. Moreover, to generate high-quality modified speech sounds, we avoid the use of vocoder-based excitation generation by presenting several implementations of direct waveform modification capable of directly filtering an input speech signal using the differences in spectral parameters. The experimental results demonstrate that: 1) higher accuracy in the estimation of spectral parameters is achieved by using sequential inversion and production mappings than for conventional production mapping using measured articulatory parameters, 2) the method for manipulating articulatory parameters by considering their intercorrelation makes it possible to generate more natural trajectories of modified articulatory movements; 3) the implementations of the direct waveform modification method significantly improve the quality of modified speech sounds, even under varying speaking conditions; and 4) the controllability of the system is ensured by its capability of producing modified vowel sounds through the manipulation of appropriate articulatory configurations.

[1]  Korin Richmond,et al.  Trajectory Mixture Density Networks with Multiple Mixtures for Acoustic-Articulatory Inversion , 2007, NOLISP.

[2]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[3]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[4]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[5]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  V. Gracco,et al.  Accurate recovery of articulator positions from acoustics: new conclusions based on human data. , 1996, The Journal of the Acoustical Society of America.

[7]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Tomoki Toda,et al.  Articulatory controllable speech modification based on Gaussian mixture models with direct waveform modification using spectrum differential , 2015, INTERSPEECH.

[9]  Gérard Bailly,et al.  An Audiovisual Talking Head for Augmented Speech Generation: Models and Animations Based on a Real Speaker's Articulatory Data , 2008, AMDO.

[10]  Athanasios Katsamanis,et al.  A Multimodal Real-Time MRI Articulatory Corpus for Speech Research , 2011, INTERSPEECH.

[11]  Yoshihiko Nankaku,et al.  On the Use of Phonetic Information for Mapping from Articulatory Movements to Vocal Tract Spectrum , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  P. Schönle,et al.  Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract , 1987, Brain and Language.

[13]  Phil Hoole,et al.  Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus , 2011, INTERSPEECH.

[14]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[15]  Rüdiger Hoffmann,et al.  Audiovisual Tools for Phonetic and Articulatory Visualization in Computer-Aided Pronunciation Training , 2009, COST 2102 Training School.

[16]  C. Neuschaefer-Rube,et al.  A visual articulatory model and its application to therapy of speech disorders: a pilot study , 2005 .

[17]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[18]  Shrikanth Narayanan,et al.  Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. , 2011, The Journal of the Acoustical Society of America.

[19]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[20]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[21]  Keiichi Tokuda,et al.  Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis , 2004, SSW.

[22]  Articulatory compensation: A study of ambiguities in the acoustic‐articulatory mapping , 1976 .

[23]  Mohan Sondhi Articulatory modeling: a possible role in concatenative text-to-speech synthesis , 2002 .

[24]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[25]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[26]  H. Wakita Estimation of vocal-tract shapes from acoustical analysis of the speech wave: The state of the art , 1979 .

[27]  Ricardo Gutierrez-Osuna,et al.  Data driven articulatory synthesis with deep neural networks , 2016, Comput. Speech Lang..

[28]  Zhen-Hua Ling,et al.  Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Christopher T Kello,et al.  A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. , 2004, The Journal of the Acoustical Society of America.

[30]  Bernd J. Kröger,et al.  Two- and three-dimensional visual articulatory models for pronunciation training and for treatment of speech disorders , 2008, INTERSPEECH.

[31]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[32]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[33]  Dominic W. Massaro,et al.  The Psychology and Technology of Talking Heads: Applications in Language Learning , 2005 .

[34]  Gérard Bailly,et al.  Can tongue be recovered from face? the answer of data-driven statistical models , 2010, INTERSPEECH.

[35]  Gérard Bailly,et al.  Acoustic-to-articulatory inversion using speech recognition and trajectory formation based on phoneme hidden Markov models , 2009, INTERSPEECH.

[36]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[37]  Tomoki Toda,et al.  Statistical singing voice conversion with direct waveform modification based on the spectrum differential , 2014, INTERSPEECH.

[38]  Masaaki Honda,et al.  Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints , 1998, ICSLP.

[39]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Thierry Dutoit,et al.  Analysis and synthesis of hypo- and hyperarticulated speech , 2010, SSW.

[41]  Tomoki Toda,et al.  Articulatory controllable speech modification based on statistical feature mapping with Gaussian mixture models , 2014, INTERSPEECH.

[42]  Tomoki Toda,et al.  Statistical singing voice conversion based on direct waveform modification with global variance , 2015, INTERSPEECH.

[43]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[44]  Masaaki Honda,et al.  Determination of the vocal tract spectrum from the articulatory movements based on the search of an articulatory-acoustic database , 1998, ICSLP.

[45]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[46]  Bajibabu Bollepalli,et al.  Modelling a Noisy-channel for Voice Conversion Using Articulatory Features , 2012, INTERSPEECH.

[47]  Juergen Schroeter,et al.  Speech coding based on physiological models of speech production , 1992 .

[48]  Cecil H. Coker,et al.  Articulatory analysis and synthesis of speech , 1989, Fourth IEEE Region 10 International Conference TENCON.

[49]  Mark K. Tiede,et al.  Vocal Tract Length Normalization for Speaker Independent Acoustic-to-Articulatory Speech Inversion , 2016, INTERSPEECH.

[50]  P. Mermelstein Determination of the vocal-tract shape from measured formant frequencies. , 1967, The Journal of the Acoustical Society of America.

[51]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[52]  Man Mohan Sondhi,et al.  A hybrid time-frequency domain articulatory speech synthesizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[53]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[54]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[55]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.