Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion.

Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. Normalizing the speaker differences is essential to effectively using multi-speaker articulatory data for training a speaker independent speech inversion system. This paper explores a vocal tract length normalization (VTLN) technique to transform the acoustic features of different speakers to a target speaker acoustic space such that speaker specific details are minimized. The speaker normalized features are then used to train a deep feed-forward neural network based speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients. The articulatory features are represented by six tract-variable (TV) trajectories, which are relatively speaker invariant compared to flesh point data. Experiments are performed with ten speakers from the University of Wisconsin X-ray microbeam database. Results show that the proposed speaker normalization approach provides an 8.15% relative improvement in correlation between actual and estimated TVs as compared to the system where speaker normalization was not performed. To determine the efficacy of the method across datasets, cross speaker evaluations were performed across speakers from the Multichannel Articulatory-TIMIT and EMA-IEEE datasets. Results prove that the VTLN approach provides improvement in performance even across datasets.

[1]  Kazuyo Tanaka,et al.  Toward an acoustic-articulatory model of inter-speaker variability , 2000, INTERSPEECH.

[2]  Miguel Á. Carreira-Perpiñán,et al.  An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping , 2007, INTERSPEECH.

[3]  Ricardo Gutierrez-Osuna,et al.  Accent conversion through cross-speaker articulatory synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yves Laprie,et al.  A variational approach for estimating vocal tract shapes from the speech signal , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  M. Cavin,et al.  The use of ultrasound biofeedback for improving English /r/ , 2015 .

[6]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[7]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[8]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[9]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[10]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[11]  Thomas F. Quatieri,et al.  Classification of depression state based on articulatory precision , 2013, INTERSPEECH.

[12]  J. Hogden,et al.  Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding , 1996 .

[13]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[14]  Dimitra Vergyri,et al.  The SRI AVEC-2014 Evaluation System , 2014, AVEC '14.

[15]  P. Schönle,et al.  Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract , 1987, Brain and Language.

[16]  Sacha Krstulovic Speech analysis with production constraints , 2001 .

[17]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[18]  Elliot Saltzman,et al.  Articulatory features from deep neural networks and their role in speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Shrikanth Narayanan,et al.  A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[20]  Yves Laprie,et al.  Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion. , 2005, The Journal of the Acoustical Society of America.

[21]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[22]  Kiyoshi Honda,et al.  Principal components of vocal-tract area functions and inversion of vowels by linear regression of cepstrum coefficients , 2007, J. Phonetics.

[23]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[24]  Louis Goldstein,et al.  Recognizing articulatory gestures from speech for robust speech recognition. , 2012, The Journal of the Acoustical Society of America.

[25]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[26]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[27]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[28]  Richard S. McGowan,et al.  Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests , 1994, Speech Commun..

[29]  Shrikanth S. Narayanan,et al.  A subject-independent acoustic-to-articulatory inversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[31]  Hosung Nam,et al.  Quantifying kinematic aspects of reduction in a contrasting rate production task , 2017 .

[32]  Laurent Girin,et al.  Speaker-Adaptive Acoustic-Articulatory Inversion Using Cascaded Gaussian Mixture Regression , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Prasanta Kumar Ghosh,et al.  Improved subject-independent acoustic-to-articulatory inversion , 2015, Speech Commun..

[34]  P. Mccabe,et al.  Ultrasound visual feedback treatment and practice variability for residual speech sound errors. , 2014, Journal of speech, language, and hearing research : JSLHR.

[35]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[36]  Elliot Saltzman,et al.  Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies , 2010, IEEE Journal of Selected Topics in Signal Processing.

[37]  Alex Waibel,et al.  Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition , 1997 .

[38]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[39]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.