Multi-Corpus Acoustic-to-Articulatory Speech Inversion

There are several technologies like Electromagnetic articulometry (EMA), ultrasound, real-time Magnetic Resonance Imaging (MRI), and X-ray microbeam that are used to measure speech articulatory movements. Each of these techniques provides a different view of the vocal tract. The measurements performed using the similar techniques also differ greatly due to differences in the placement of sensors, and the anatomy of speakers. This limits most articulatory studies to single datasets. However to yield better results in its applications, the speech inversion systems should be more generalized, which requires the combination of data from multiple sources. This paper proposes a multi-task learning based deep neural network architecture for acoustic-to-articulatory speech inversion trained using three different articulatory datasets two of them were measured using EMA, and one using X-ray microbeam. Experiments show improved accuracy of the proposed acoustic-to-articulatory mapping compared to the systems trained using single datasets.

[1]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[2]  Miguel Á. Carreira-Perpiñán,et al.  An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping , 2007, INTERSPEECH.

[3]  An Ji,et al.  Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  C. C. Goodyear,et al.  On the use of neural networks in articulatory speech synthesis , 1993 .

[5]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Louis Goldstein,et al.  Recognizing articulatory gestures from speech for robust speech recognition. , 2012, The Journal of the Acoustical Society of America.

[7]  Prasanta Kumar Ghosh,et al.  Improved subject-independent acoustic-to-articulatory inversion , 2015, Speech Commun..

[8]  Laurent Girin,et al.  Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  An Ji,et al.  Speaker independent acoustic-to-articulatory inversion , 2014 .

[10]  Richard S. McGowan,et al.  Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests , 1994, Speech Commun..

[11]  Yves Laprie,et al.  Design of hypercube codebooks for the acoustic-to-articulatory inversion respecting the non-linearities of the articulatory-to-acoustic mapping , 1999, EUROSPEECH.

[12]  Hosung Nam,et al.  Quantifying kinematic aspects of reduction in a contrasting rate production task , 2017 .

[13]  Mark K. Tiede,et al.  Vocal Tract Length Normalization for Speaker Independent Acoustic-to-Articulatory Speech Inversion , 2016, INTERSPEECH.

[14]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[15]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[16]  Shrikanth Narayanan,et al.  A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[17]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[18]  Elliot Saltzman,et al.  Articulatory features from deep neural networks and their role in speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).