论文信息 - Low Resource Acoustic-to-articulatory Inversion Using Bi-directional Long Short Term Memory

Low Resource Acoustic-to-articulatory Inversion Using Bi-directional Long Short Term Memory

Estimating articulatory movements from speech acoustic features is known as acoustic-to-articulatory inversion (AAI). Large amount of parallel data from speech and articulatory motion is required for training an AAI model in a subject dependent manner, referred to as subject dependent AAI (SD-AAI). Electromagnetic articulograph (EMA) is a promising technology to record such parallel data, but it is expensive, time consuming and tiring for a subject. In order to reduce the demand for parallel acoustic-articulatory data in the AAI task for a subject, we, in this work, propose a subject-adaptative AAI method (SA-AAI) from an existing AAI model which is trained using large amount of parallel data from a fixed set of subjects. Experiments are performed with 30 subjects’ acoustic-articulatory data and AAI is trained using BLSTM network to examine the amount of data needed from a new target subject for the SAAAI to achieve an AAI performance equivalent to that of SDAAI. Experimental results reveal that the proposed SA-AAI performs similar to that of the SD-AAI with∼62.5% less training data. Among different articulators, the SA-AAI performance for tongue articulators matches with the corresponding SD-AAI performance with only ∼12.5% of the data used for SD-AAI training.

Prasanta Kumar Ghosh | Aravind Illa | P. Ghosh | Aravind Illa

[1] Mark K. Tiede,et al. Vocal Tract Length Normalization for Speaker Independent Acoustic-to-Articulatory Speech Inversion , 2016, INTERSPEECH.

[2] Korin Richmond,et al. Estimating articulatory parameters from the acoustic speech signal , 2002 .

[3] Chiranjeevi Yarra,et al. Comparison of speech quality with and without sensors in electromagnetic articulograph AG 501 recording , 2014, INTERSPEECH.

[4] Gérard Bailly,et al. Speaker adaptation of an acoustic-articulatory inversion model using cascaded Gaussian mixture regressions , 2013, INTERSPEECH.

[5] Katrin Kirchhoff,et al. Robust speech recognition using articulatory information , 1998 .

[6] B. Atal,et al. Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[7] Lei Xie,et al. Head motion synthesis from speech using deep neural networks , 2015, Multimedia Tools and Applications.

[8] Prasanta Kumar Ghosh,et al. Improved subject-independent acoustic-to-articulatory inversion , 2015, Speech Commun..

[9] Steve Renals,et al. A Deep Neural Network for Acoustic-Articulatory Speech Inversion , 2011 .

[10] Simon King,et al. An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[11] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[12] Jürgen Schmidhuber,et al. LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[13] Daniel Povey,et al. Universal background model based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Kai Zhao,et al. Acoustic to articulatory mapping with deep neural network , 2014, Multimedia Tools and Applications.

[15] Korin Richmond. A multitask learning perspective on acoustic-articulatory inversion , 2007, INTERSPEECH.

[16] Korin Richmond,et al. A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[17] Shrikanth S. Narayanan,et al. Speaker verification based on the fusion of speech acoustics and inverted articulatory signals , 2016, Comput. Speech Lang..

[18] Peng Liu,et al. A deep recurrent approach for acoustic-to-articulatory inversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Yongxin Wang,et al. Emotional Audio-Visual Speech Synthesis Based on PAD , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20] Lianhong Cai,et al. Head and facial gestures synthesis using PAD model for an expressive talking avatar , 2014, Multimedia Tools and Applications.

[21] Ren-Hua Wang,et al. Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22] Le Zhang,et al. Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[23] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[24] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25] Laurent Girin,et al. Speaker-Adaptive Acoustic-Articulatory Inversion Using Cascaded Gaussian Mixture Regression , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26] An Ji,et al. Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27] Prasanta Kumar Ghosh,et al. Optimal sensor placement in electromagnetic articulography recording for speech production study , 2018, Comput. Speech Lang..

[28] Keiichi Tokuda,et al. Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[29] Steve Young,et al. The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[30] Prasanta Kumar Ghosh,et al. A comparative study of acoustic-to-articulatory inversion for neutral and whispered speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[32] Shrikanth Narayanan,et al. A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[33] Yves Laprie,et al. Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion. , 2005, The Journal of the Acoustical Society of America.

[34] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.