Vowel recognition from articulatory position time-series data

A new approach of recognizing vowels from articulatory position time-series data was proposed and tested in this paper. This approach directly mapped articulatory position time-series data to vowels without extracting articulatory features such as mouth opening. The input time-series data were time-normalized and sampled to fixed-width vectors of articulatory positions. Three commonly used classifiers, Neural Network, Support Vector Machine and Decision Tree were used and their performances were compared on the vectors. A single speaker dataset of eight major English vowels acquired using Electromagnetic Articulograph (EMA) AG500 was used. Recognition rate using cross validation ranged from 76.07% to 91.32% for the three classifiers. In addition, the trained decision trees were consistent with articulatory features commonly used to descriptively distinguish vowels in classical phonetics. The findings are intended to improve the accuracy and response time of a real-time articulatory-to-acoustics synthesizer.

[1]  Kazunori Sugahara,et al.  Vowel recognition according to lip shapes by using neural network , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[4]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[5]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  John R. Lindsay Smith,et al.  Learning to Pronounce Vowel Sounds in a Foreign Language using Acoustic Measurements of the Vocal Tract as Feedback in Real Time , 1998 .

[8]  Takaaki Kuratate,et al.  Linking facial animation, head motion and speech acoustics , 2002, J. Phonetics.

[9]  Christopher T Kello,et al.  A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. , 2004, The Journal of the Acoustical Society of America.

[10]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[11]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[12]  Piero Cosi,et al.  Lips and Jaw Movements for Vowels and Consonants: Spatio-Temporal Characteristics and Bimodal Recognition Applications , 1996 .

[13]  Khashayar Yaghmaie,et al.  Vowel Recognition using Neural Networks , 2006 .

[14]  Steve J. Young,et al.  Pseudo-articulatory speech synthesis for recognition using automatic feature extraction from X-ray data , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Yana Yunusova,et al.  Accuracy assessment for AG500, electromagnetic articulograph. , 2009, Journal of speech, language, and hearing research : JSLHR.

[16]  Katsuhiko Shirai,et al.  A Trial of Japanese Text Input System Using Speech Recognition , 1980, COLING.

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  Jordan R Green,et al.  Estimating mandibular motion based on chin surface targets during speech. , 2007, Journal of speech, language, and hearing research : JSLHR.

[19]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[20]  Jon Barker,et al.  Evidence of correlation between acoustic and visual features of speech , 1999 .

[21]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[22]  Hani Yehia,et al.  Measuring the relation between speech acoustics and 2D facial motion , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).