Statistical multi-stream modeling of real-time MRI articulatory speech data

This paper investigates different statistical modeling frameworks for articulatory speech data obtained using real-time (RT) magnetic resonance imaging (MRI). To quantitatively capture the spatio-temporal shaping process of the human vocal tract during speech production a multi-dimensional stream of direct image features is extracted automatically from the MRI recordings. The features are closely related, though not identical, to the tract variables commonly defined in the articulatory phonology theory. The modeling of the shaping process aims at decomposing the articulatory data streams into primitives by segmentation. A variety of approaches are investigated for carrying out the segmentation task including vector quantizers, Gaussian Mixture Models, Hidden Markov Models, and a coupled Hidden Markov Model. We evaluate the performance of the different segmentation schemes qualitatively with the help of a well understood data set which was used in an earlier study of inter-articulatory timing phenomena of American English nasal sounds. Index Terms: speech production, articulatory modeling, realtime magnetic resonance imaging

[1]  Julie Fontecave Jallon,et al.  A semi-automatic method for extracting vocal tract movements from X-ray films , 2009, Speech Commun..

[2]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[3]  Shrikanth S. Narayanan,et al.  Data-driven analysis of realtime vocal tract MRI using correlated image regions , 2010, INTERSPEECH.

[4]  Shrikanth S. Narayanan,et al.  Timing effects of syllable structure and stress on nasals: A real-time MRI examination , 2009, J. Phonetics.

[5]  Shrikanth S. Narayanan,et al.  An analysis of articulatory-acoustic data based on articulatory strokes , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[7]  Yoon-Chul Kim,et al.  Seeing speech: Capturing vocal tract shaping using real-time magnetic resonance imaging [Exploratory DSP] , 2008, IEEE Signal Processing Magazine.

[8]  Shrikanth S. Narayanan,et al.  Region Segmentation in the Frequency Domain Applied to Upper Airway Real-Time Magnetic Resonance Images , 2009, IEEE Transactions on Medical Imaging.

[9]  Louis Goldstein,et al.  Towards an articulatory phonology , 1986, Phonology.

[10]  Shrikanth Narayanan,et al.  An analysis‐by‐synthesis approach to modeling real‐time MRI articulatory data using the task dynamic application framework. , 2009 .

[11]  Gérard Bailly,et al.  Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[12]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[13]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..