Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals

We propose a new acoustic-to-articulatory inversion (AAI) sequence-to-sequence neural architecture, where spectral subbands are independently processed in time by 1-dimensional (1-D) convolutional filters of different sizes. The learned features maps are then combined and processed by a recurrent block with bi-directional long short-term memory (BLSTM) gates for preserving the smoothly varying nature of the articulatory trajectories. Our experimental evidence shows that, on a speaker dependent AAI task, in spite of the reduced number of parameters, our model demonstrates better root mean squared error (RMSE) and Pearson’s correlation coefficient (PCC) than a both a BLSTM model and an FC-BLSTM model where the first stages are fully connected layers. In particular, the average RMSE goes from 1.401 when feeding the filterbank features directly into the BLSTM, to 1.328 with the FC-BLSTM model, and to 1.216 with the proposed method. Similarly, the average PCC increases from 0.859 to 0.877, and 0.895, respectively. On a speaker independent AAI task, we show that our convolutional features outperform the original filterbank features, and can be combined with phonetic features bringing independent information to the solution of the problem. To the best of the authors’ knowledge, we report the best results on the given task and data.

[1]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[2]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[3]  Gérard Bailly,et al.  Cross-speaker Acoustic-to-Articulatory Inversion using Phone-based Trajectory HMM for Pronunciation Training , 2012, INTERSPEECH.

[4]  Steve Renals,et al.  Deep Architectures for Articulatory Inversion , 2012, INTERSPEECH.

[5]  Salil Deena,et al.  Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model , 2013, IEEE Transactions on Multimedia.

[6]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Chao Zhang,et al.  Multi-Span Acoustic Modelling using Raw Waveform Signals , 2019, INTERSPEECH.

[10]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[11]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[12]  Lan Wang,et al.  Deep Neural Network Based Acoustic-to-Articulatory Inversion Using Phone Sequence Information , 2016, INTERSPEECH.

[13]  Simon King,et al.  Smooth talking: Articulatory join costs for unit selection , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[15]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Carol Y. Espy-Wilson,et al.  Multi-Corpus Acoustic-to-Articulatory Speech Inversion , 2019, INTERSPEECH.

[17]  Lei Xie,et al.  Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings , 2015, INTERSPEECH.

[18]  Peng Liu,et al.  A deep recurrent approach for acoustic-to-articulatory inversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Gérard Bailly,et al.  Toward a Multi-Speaker Visual Articulatory Feedback System , 2011, INTERSPEECH.

[20]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[21]  An Ji,et al.  Speaker independent acoustic-to-articulatory inversion , 2014 .

[22]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[23]  Hosung Nam,et al.  Quantifying kinematic aspects of reduction in a contrasting rate production task , 2017 .

[24]  Ali Shariq Imran,et al.  A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion , 2019, INTERSPEECH.

[25]  Slim Ouni,et al.  Phoneme-to-Articulatory Mapping Using Bidirectional Gated RNN , 2018, INTERSPEECH.

[26]  Prasanta Kumar Ghosh,et al.  Representation Learning Using Convolution Neural Network for Acoustic-to-articulatory Inversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Rui Li,et al.  An audio-visual 3D virtual articulation system for visual speech synthesis , 2017, 2017 IEEE International Symposium on Haptic, Audio and Visual Environments and Games (HAVE).

[28]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[29]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[30]  P. Schönle,et al.  Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract , 1987, Brain and Language.