Sparse smoothing of articulatory features from Gaussian mixture model based acoustic-to-articulatory inversion: benefit to speech recognition

Speech recognition using articulatory features estimated using Acoustic-to-Articulatory Inversion (AAI) is considered. A recently proposed sparse smoothing approach is used to postprocess the estimates from Gaussian Mixture Model (GMM) based AAI using Minimum Mean Squared Error (MMSE) criterion. It is well known that low-pass smoothing as post-processing improves the AAI performance. Sparse smoothing, on the other hand, not only improves the AAI performance but also preserves the MMSE optimality for as many estimates as possible. In this work we investigate the benefit of preserving MMSE optimality during postprocessing by using the smoothed articulatory estimates in a broad class phonetic recognition task. Experimental results show that the low-pass filter based smoothing results in a significant drop in the recognition accuracy compared to that using articulatory estimates without any smoothing. However, the recognition accuracy obtained by articulatory features from sparse smoothing is similar to that using articulatory features directly from GMM based AAI without any postprocessing. Thus, sparse smoothing provides benefit both in terms of the inversion performance as well as recognition accuracy, while that is not the case with low-pass smoothing.

[1]  Slim Ouni,et al.  Continuous Episodic Memory Based Speech Recognition Using Articulatory Dynamics , 2011, INTERSPEECH.

[2]  Antonin Chambolle,et al.  A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging , 2011, Journal of Mathematical Imaging and Vision.

[3]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[4]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[5]  Olle Bälter,et al.  Wizard-of-Oz test of ARTUR: a computer-based speech training system with articulation correction , 2005, Assets '05.

[6]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[7]  Laurent Jacques,et al.  A sparse smoothing approach for Gaussian Mixture Model based Acoustic-to-Articulatory Inversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[9]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[10]  Tara N. Sainath,et al.  Broad phonetic class recognition in a Hidden Markov model framework using extended Baum-Welch transformations , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Massimo Fornasier,et al.  Numerical Methods for Sparse Recovery , 2010 .

[12]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[13]  Shrikanth S. Narayanan,et al.  On smoothing articulatory trajectories obtained from Gaussian mixture model based acoustic-to-articulatory inversion. , 2013, The Journal of the Acoustical Society of America.

[14]  Mohan Sondhi Articulatory modeling: a possible role in concatenative text-to-speech synthesis , 2002 .

[15]  Slim Ouni,et al.  An episodic memory-based solution for the acoustic-to-articulatory inversion problem. , 2013, The Journal of the Acoustical Society of America.

[16]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[17]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[18]  K. Margaritis,et al.  A ROUGH GUIDE TO THE ACOUSTIC-TO-ARTICULATORY INVERSION OF SPEECH , 2003 .

[19]  Shrikanth Narayanan,et al.  A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[20]  Yves Laprie,et al.  Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion. , 2005, The Journal of the Acoustical Society of America.

[21]  Shrikanth S. Narayanan,et al.  Speaker verification based on fusion of acoustic and articulatory information , 2013, INTERSPEECH.

[22]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[23]  Olle Bälter,et al.  Designing the user interface of the computer-based speech training system ARTUR based on early user tests , 2006, Behav. Inf. Technol..

[24]  Li Deng,et al.  Maximum-likelihood estimation for articulatory speech recognition using a stochastic target model , 1995, EUROSPEECH.

[25]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[26]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[27]  Samy Bengio,et al.  Automatic speech recognition using dynamic bayesian networks with both acoustic and articulatory variables , 2000, INTERSPEECH.