BOOSTING AUTOMATIC SPEECH RECOGNITION THROUGH ARTICULATORY INVERSION

This paper explores whether articulatory features predicted from speech acoustics through inversion may be used to boost the recognition of context-dependent units when combined with acoustic features. For this purpose, we performed articulatory inversion on a corpus containing acoustic and electromagnetic articulography recordings from a single speaker. We then compared the performance of an HMM-based diphone classifier on the individual feature sets (acoustic, articulatory, inversion) as well as on their combinations. To make good use of the limited corpus, we used a factorized representation that first classified diphones into broad overlapping categories and then combined them using a maximum-a-posteriori criterion. When comparing the individual feature sets, our results show no degradation in classification performance when predicted articulators are used instead of ground-truth articulators. Further, performance on the acoustic feature set improved by 10% when adding ground-truth articulators and by 5% when adding predicted articulators.

[1]  S. Maeda An articulatory model of the tongue based on a statistical analysis , 1979 .

[2]  W J Hardcastle,et al.  The Use of Electropalatography in Phonetic Research , 1972, Phonetica.

[3]  Simon King,et al.  Articulatory feature classifiers trained on 2000 hours of telephone speech , 2007, INTERSPEECH.

[4]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[5]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[6]  Man Mohan Sondhi,et al.  An investigation of the potential role of speech production models in automatic speech recognition , 1994, ICSLP.

[7]  Philip Hoole,et al.  Beyond 2D in articulatory data acquisition and analysis , 2003 .

[8]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[9]  Simon King,et al.  Dynamical system modelling of articulator movement. , 1999 .

[10]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[11]  Jun Wang,et al.  Vowel recognition from articulatory position time-series data , 2009, 2009 3rd International Conference on Signal Processing and Communication Systems.

[12]  Ricardo Gutierrez-Osuna,et al.  Foreign Accent Conversion Through Concatenative Synthesis in the Articulatory Domain , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Ulrike Hahn,et al.  Phoneme similarity and confusability , 2005 .

[14]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[15]  Shrikanth Narayanan,et al.  Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. , 2011, The Journal of the Acoustical Society of America.

[16]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[17]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[18]  Richard M. Stern,et al.  Analysis-by-synthesis features for speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Hideki Kawahara,et al.  Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Marco Saerens,et al.  COMPLEMENTARY CUES FOR SPEECH RECOGNITION , 2000 .

[21]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[22]  Keith Vertanen Baseline Wsj Acoustic Models for Htk and Sphinx : Training Recipes and Recognition Experiments , 2007 .

[23]  Miguel Á. Carreira-Perpiñán,et al.  A comparison of acoustic features for articulatory inversion , 2007, INTERSPEECH.