Towards a Method of Dynamic Vocal Tract Shapes Generation by Combining Static 3D and Dynamic 2D MRI Speech Data

We present an algorithm for augmenting the shape of the vocal tract using 3D static and 2D dynamic speech MRI data. While static 3D images have better resolution and provide spatial information, 2D dynamic images capture the transitions. The aim of this work is to combine strong points of these two types of data to obtain better image quality of 2D dynamic images and extend the 2D dynamic images to the 3D domain. To produce a 3D dynamic consonant-vowel (CV) sequence, our algorithm takes as input the 2D CV transition and the static 3D targets for C and V. To obtain the enhanced sequence of images , the first step is to find a transformation between the 2D images and the mid-sagittal slice of the acoustically corresponding 3D image stack, and then find a transformation between neighbouring sagittal slices in the 3D static image stack. Combination of these transformations allows producing the final set of images. In the present study we first examined the transformation from the 3D mid-sagittal frame to the 2D video in order to improve image quality and then we examined the extension of the 2D video to the 3rd dimension with the aim to enrich spatial information.

[1]  Tom Vercauteren,et al.  Diffeomorphic demons: Efficient non-parametric image registration , 2009, NeuroImage.

[2]  Shrikanth Narayanan,et al.  3D dynamic MRI of the vocal tract during natural speech , 2018, Magnetic resonance in medicine.

[3]  Steve Young,et al.  The HTK book , 1995 .

[4]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[5]  P. Birkholz Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis , 2013, PloS one.

[6]  M H Cohen,et al.  Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements. , 1992, The Journal of the Acoustical Society of America.

[7]  M Stone,et al.  A head and transducer support system for making ultrasound images of tongue/jaw movement. , 1995, The Journal of the Acoustical Society of America.

[8]  J. Dang,et al.  Estimation of vocal tract shapes from speech sounds with a physiological articulatory model , 2002, J. Phonetics.

[9]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[10]  Shrikanth S. Narayanan,et al.  Estimation of vocal tract area function from volumetric Magnetic Resonance Imaging , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Eric Vatikiotis-Bateson,et al.  The Haskins optically corrected ultrasound system (HOCUS). , 2005, Journal of speech, language, and hearing research : JSLHR.

[12]  Olov Engwall Tongue Talking : Studies in Intraoral Speech Synthesis , 2002 .

[13]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Zhi-Pei Liang,et al.  High‐frame‐rate full‐vocal‐tract 3D dynamic speech imaging , 2017, Magnetic resonance in medicine.

[15]  Yves Laprie,et al.  Vowel and prosodic factor dependent variations of vocal-tract length , 2013, INTERSPEECH.

[16]  Shrikanth S. Narayanan,et al.  Analysis of speech production real-time MRI , 2018, Comput. Speech Lang..

[17]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[18]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[19]  Brad H. Story,et al.  Phrase-level speech simulation with an airway modulation model of speech production , 2013, Comput. Speech Lang..

[20]  Jean-Philippe Thirion,et al.  Image matching as a diffusion process: an analogy with Maxwell's demons , 1998, Medical Image Anal..

[21]  Peter Birkholz,et al.  A three-dimensional model of the vocal tract for speech synthesis , 2003 .

[22]  Zsuzsanna Fagyal,et al.  French: A Linguistic Introduction , 2006 .

[23]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[24]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[25]  Marc E Miquel,et al.  Application of radial GRAPPA techniques to single‐ and multislice dynamic speech MRI using a 16‐channel neurovascular coil , 2019, Magnetic resonance in medicine.

[26]  Shrikanth S. Narayanan,et al.  Advances in real-time magnetic resonance imaging of the vocal tract for speech science and technology research , 2016, APSIPA Transactions on Signal and Information Processing.

[27]  Anastasiia Tsukanova,et al.  Centerline articulatory models of the velum and epiglottis for articulatory synthesis of speech , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[28]  Jens Frahm,et al.  Real‐time MRI of speaking at a resolution of 33 ms: Undersampled radial FLASH with nonlinear inverse reconstruction , 2013, Magnetic resonance in medicine.

[29]  Anastasiia Tsukanova,et al.  Articulatory Speech Synthesis from Static Context-Aware Articulatory Targets , 2017, ISSP.

[30]  Jens Frahm,et al.  Real‐time MRI at a resolution of 20 ms , 2010, NMR in biomedicine.