Transfer Learning of Articulatory Information Through Phone Information

Articulatory information has been argued to be useful for several speech tasks. However, in most practical scenarios this information is not readily available. We propose a novel transfer learning framework to obtain reliable articulatory information in such cases. We demonstrate its reliability both in terms of estimating parameters of speech production and its ability to enhance the accuracy of an end-to-end phone recognizer. Articulatory information is estimated from speaker independent phonemic features, using a small speech corpus, with electromagnetic articulography (EMA) measurements. Next, we employ a teacher-student model to learn estimation of articulatory features from acoustic features for the targeted phone recognition task. Phone recognition experiments, demonstrate that the proposed transfer learning approach outperforms the baseline transfer learning system acquired directly from an acousticto-articulatory (AAI) model. The articulatory features estimated by the proposed method, in conjunction with acoustic features, improved the phone error rate (PER) by 6.7% and 6% on the TIMIT core test and development sets, respectively, compared to standalone static acoustic features. Interestingly, this improvement is slightly higher than what is obtained by static+dynamic acoustic features, but with a significantly less. Adding articulatory features on top of static+dynamic acoustic features yields a small but positive PER improvement.

[1]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[2]  Zhen-Hua Ling,et al.  Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Ali Shariq Imran,et al.  A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion , 2019, INTERSPEECH.

[4]  Slim Ouni,et al.  Phoneme-to-Articulatory Mapping Using Bidirectional Gated RNN , 2018, INTERSPEECH.

[5]  Gerhard Nahler,et al.  Pearson Correlation Coefficient , 2020, Definitions.

[6]  An Ji,et al.  Speaker independent acoustic-to-articulatory inversion , 2014 .

[7]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8]  Simon King,et al.  Smooth talking: Articulatory join costs for unit selection , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Hosung Nam,et al.  Quantifying kinematic aspects of reduction in a contrasting rate production task , 2017 .

[10]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[11]  Peng Liu,et al.  A deep recurrent approach for acoustic-to-articulatory inversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  H. Hermansky,et al.  The front‐cavity/F2′ hypothesis tested by data on tongue movements , 1989 .

[13]  Lan Wang,et al.  Deep Neural Network Based Acoustic-to-Articulatory Inversion Using Phone Sequence Information , 2016, INTERSPEECH.

[14]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[15]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[17]  Gérard Bailly,et al.  Cross-speaker Acoustic-to-Articulatory Inversion using Phone-based Trajectory HMM for Pronunciation Training , 2012, INTERSPEECH.

[18]  Prasanta Kumar Ghosh,et al.  An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion , 2019, INTERSPEECH.

[19]  P. Schönle,et al.  Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract , 1987, Brain and Language.

[20]  Phil Hoole,et al.  Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus , 2011, INTERSPEECH.

[21]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[22]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[23]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[24]  James R. Glass,et al.  HETEROGENEOUS ACOUSTIC MEASUREMENTS FOR PHONETIC CLASSIFICATION , 1997 .

[25]  Lei Xie,et al.  Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings , 2015, INTERSPEECH.

[26]  Sabato Marco Siniscalchi,et al.  Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals , 2020, INTERSPEECH.

[27]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[28]  Tamás Gábor Csapó,et al.  DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[29]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.