Estimation of the air-tissue boundaries of the vocal tract in the mid-sagittal plane from electromagnetic articulograph data

Electromagnetic articulograph (EMA) provides movement data of sensors attached to a few flesh points on different speech articulators including lips, jaw, and tongue while a subject speaks. In this work, we quantify the amount of information these flesh points provide about the vocal tract (VT) shape in the mid-sagittal plane. VT shape is described by the air-tissue boundaries, which are obtained manually from the recordings by real-time magnetic resonance imaging (rtMRI) of a set of utterances spoken by a subject, from whom the EMA recordings of the same set of utterances are also available. We propose a two-stage approach for reconstructing the VT shape from the EMA data. The first stage involves a co-registration of the EMA data with the VT shape from the rtMRI frames. The second stage involves the estimation of the air-tissue boundaries from the co-registered EMA points. Co-registration is done by a spatio-temporal alignment of the VT shapes from the rtMRI frames and EMA sensor data, while radial basis function (RBF) network is used for estimating the air tissue boundaries (ATBs). Experiments with the EMA and rtMRI recordings of five sentences spoken by one male and one female speakers show that the VT shape in the mid-sagittal plane can be recovered from the EMA flesh points with an average reconstruction error of 2.55 mm and 2.75 mm respectively.

[1]  M Stone,et al.  Comparison of speech production in upright and supine position. , 2007, The Journal of the Acoustical Society of America.

[2]  Eamonn J. Keogh,et al.  Scaling up dynamic time warping for datamining applications , 2000, KDD '00.

[3]  Miguel Á. Carreira-Perpiñán,et al.  An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping , 2007, INTERSPEECH.

[4]  Miguel Á. Carreira-Perpiñán,et al.  Predicting tongue shapes from a few landmark locations , 2008, INTERSPEECH.

[5]  Yves Laprie,et al.  Protocol for a Model-based Evaluation of a Dynamic Acoustic-to-Articulatory Inversion Method using Electromagnetic Articulography , 2008 .

[6]  Miguel Á. Carreira-Perpiñán,et al.  Reconstructing the full tongue contour from EMA/X-ray microbeam , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  J M Rubin,et al.  Pseudo-three-dimensional reconstruction of ultrasonic images of the tongue. , 1989, The Journal of the Acoustical Society of America.

[8]  P. Schönle,et al.  Re-examination of the relation between the vocal tract and the vowel sound with electromagnetic articulography (EMA) in vocalizations , 1993 .

[9]  Didier Demolin,et al.  REAL TIME MRI AND ARTICULATORY COORDINATIONS IN VOWELS , 2000 .

[10]  Yoon-Chul Kim,et al.  Seeing speech: Capturing vocal tract shaping using real-time magnetic resonance imaging [Exploratory DSP] , 2008, IEEE Signal Processing Magazine.

[11]  Arif Ghafoor,et al.  An object-oriented model for spatio-temporal synchronization of multimedia information , 1994, 1994 Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[12]  Athanasios Katsamanis,et al.  A Multimodal Real-Time MRI Articulatory Corpus for Speech Research , 2011, INTERSPEECH.

[13]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[14]  M. Stone,et al.  Three-dimensional tongue surface shapes of English consonants and vowels. , 1996, The Journal of the Acoustical Society of America.

[15]  T. Kaburagi,et al.  Determination of sagittal tongue shape from the positions of points on the tongue surface. , 1994, The Journal of the Acoustical Society of America.

[16]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[17]  Daniel Rueckert,et al.  Fast Spatio-temporal Free-Form Registration of Cardiac MR Image Sequences , 2004, FIMH.

[18]  P. Ladefoged,et al.  Generating vocal tract shapes from formant frequencies. , 1978, The Journal of the Acoustical Society of America.

[19]  Mary J. Lindstrom,et al.  Differences among speakers in lingual articulation for American English /r/ , 1998, Speech Commun..

[20]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  John R. Lindsay Smith,et al.  Learning to Pronounce Vowel Sounds in a Foreign Language using Acoustic Measurements of the Vocal Tract as Feedback in Real Time , 1998 .

[22]  A J Lundberg,et al.  Three-dimensional tongue surface reconstruction: practical considerations for ultrasound data. , 1999, The Journal of the Acoustical Society of America.

[23]  Prasanta Kumar Ghosh,et al.  Co-registration of speech production datasets from electromagnetic articulography and real-time magnetic resonance imaging. , 2014, The Journal of the Acoustical Society of America.

[24]  Pierre Badin,et al.  Determining tongue articulation: from discrete fleshpoints to continuous shadow , 1997, EUROSPEECH.