A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion

The challenge of articulatory inversion is to determine the temporal movement of the articulators from the speech waveform, or from acoustic-phonetic knowledge, e.g. derived from information about the linguistic content of the utterance. The actual position of the articulators is typically obtained from measured data, in our case position measurements obtained using EMA (Electromagnetic articulography). In this paper, we investigate the impact on articulatory inversion problem by using features derived from the acoustic waveform relative to using linguistic features related to the time aligned phone sequence of the utterance. Filterbank energies (FBE) are used as acoustic features, while phoneme identities and (binary) phonetic attributes are used as linguistic features. Experiments are performed on a speech corpus with synchronously recorded EMA measurements and employing a bidirectional long short-term memory (BLSTM) that estimates the articulators’ position. Acoustic FBE features performed better for vowel sounds. Phonetic features attained better results for nasal and fricative sounds except for /h/. Further improvements were obtained by combining FBE and linguistic features, which led to an average relative RMSE reduction of 9.8%, and a 3% relative improvement of the Pearson correlation coefficient.

[1]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[2]  Peng Liu,et al.  A deep recurrent approach for acoustic-to-articulatory inversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Chin-Hui Lee,et al.  A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition , 2009, Speech Commun..

[4]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[5]  Steve Renals,et al.  Deep Architectures for Articulatory Inversion , 2012, INTERSPEECH.

[6]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[7]  Hosung Nam,et al.  Quantifying kinematic aspects of reduction in a contrasting rate production task , 2017 .

[8]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[9]  Zhen-Hua Ling,et al.  Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Slim Ouni,et al.  Phoneme-to-Articulatory Mapping Using Bidirectional Gated RNN , 2018, INTERSPEECH.

[11]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[12]  Hans-Georg Zimmermann,et al.  Recurrent Neural Networks are Universal approximators , 2007, Int. J. Neural Syst..

[13]  Hirokazu Kameoka,et al.  Deep acoustic-to-articulatory inversion mapping with latent trajectory modeling , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[14]  Carol Y. Espy-Wilson,et al.  Noise Robust Acoustic to Articulatory Speech Inversion , 2018, INTERSPEECH.

[15]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[16]  Lei Xie,et al.  Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings , 2015, INTERSPEECH.

[17]  Lan Wang,et al.  Deep Neural Network Based Acoustic-to-Articulatory Inversion Using Phone Sequence Information , 2016, INTERSPEECH.

[18]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[19]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[20]  Gérard Bailly,et al.  Toward a Multi-Speaker Visual Articulatory Feedback System , 2011, INTERSPEECH.

[21]  Shrikanth Narayanan,et al.  Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. , 2011, The Journal of the Acoustical Society of America.

[22]  Chin-Hui Lee,et al.  An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition , 2013, Proceedings of the IEEE.

[23]  Phil Hoole,et al.  Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus , 2011, INTERSPEECH.

[24]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[25]  Giorgio Metta,et al.  Integrating articulatory data in deep neural network-based acoustic modeling , 2016, Comput. Speech Lang..

[26]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[27]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[28]  Gérard Bailly,et al.  Visual articulatory feedback for phonetic correction in second language learning , 2010 .

[29]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[30]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .