A deep recurrent approach for acoustic-to-articulatory inversion

To solve the acoustic-to-articulatory inversion problem, this paper proposes a deep bidirectional long short term memory recurrent neural network and a deep recurrent mixture density network. The articulatory parameters of the current frame may have correlations with the acoustic features many frames before or after. The traditional pre-designed fixed-length context window may be either insufficient or redundant to cover such correlation information. The advantage of recurrent neural network is that it can learn proper context information on its own without the requirement of externally specifying a context window. Experimental results indicate that recurrent model can produce more accurate predictions for acoustic-to-articulatory inversion than deep neural network having fixed-length context window. Furthermore, the predicted articulatory trajectory curve of recurrent neural network is smooth. Average root mean square error of 0.816 mm on the MNGU0 test set is achieved without any post-filtering, which is state-of-the-art inversion accuracy.

[1]  Hua Yuan,et al.  Audiovisual synthesis of exaggerated speech for corrective feedback in computer-assisted pronunciation training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[3]  S. Srihari Mixture Density Networks , 1994 .

[4]  Phil Hoole,et al.  Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus , 2011, INTERSPEECH.

[5]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[6]  Dong Yu,et al.  Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Steve Renals,et al.  Deep Architectures for Articulatory Inversion , 2012, INTERSPEECH.

[8]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  M H Cohen,et al.  Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements. , 1992, The Journal of the Acoustical Society of America.

[10]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[12]  H. Strube Linear prediction on a warped frequency scale , 1980 .

[13]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[14]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[15]  Miguel Á. Carreira-Perpiñán,et al.  Adaptation of a tongue shape model by local feature transformations , 2010, INTERSPEECH.

[16]  Benjamin Schrauwen,et al.  Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[17]  Helen M. Meng,et al.  Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Ren-Hua Wang,et al.  Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[21]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[22]  Helen M. Meng,et al.  Statistical parametric speech synthesis using weighted multi-distribution deep belief network , 2014, INTERSPEECH.

[23]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.