Articulatory Synthesis Based on Real-Time Magnetic Resonance Imaging Data

This paper presents a methodology for articulatory synthesis of running speech in American English driven by real-time magnetic resonance imaging (rtMRI) mid-sagittal vocal-tract data. At the core of the methodology is a time-domain simulation of the propagation of sound in the vocal tract developed previously by Maeda. The first step of the methodology is the automatic derivation of air-tissue boundaries from the rtMRI data. These articulatory outlines are then modified in a systematic way in order to introduce additional precision in the formation of consonantal vocal-tract constrictions. Other elements of the methodology include a previously reported set of empirical rules for setting the time-varying characteristics of the glottis and the velopharyngeal port, and a revised sagittal-to-area conversion. Results are promising towards the development of a full-fledged text-to-speech synthesis system leveraging directly observed vocal-tract dynamics.

[1]  Shrikanth Narayanan,et al.  A fast and flexible MRI system for the study of dynamic vocal tract shaping , 2017, Magnetic resonance in medicine.

[2]  Shrikanth Narayanan,et al.  USC-EMO-MRI corpus: An emotional speech production database recorded by real-time magnetic resonance imaging , 2014 .

[3]  Yves Laprie,et al.  Vowel and prosodic factor dependent variations of vocal-tract length , 2013, INTERSPEECH.

[4]  Ren-Hua Wang,et al.  Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Shinji Maeda,et al.  Articulatory VCV Synthesis from EMA Data , 2012, INTERSPEECH.

[6]  Shrikanth S. Narayanan,et al.  Factor analysis of vocal-tract outlines derived from real-time magnetic resonance imaging data , 2015, ICPhS.

[7]  P Perrier,et al.  Vocal tract area function estimation from midsagittal dimensions with CT scans and a vocal tract cast: modeling the transition with two sets of coefficients. , 1992, Journal of speech and hearing research.

[8]  Yves Laprie,et al.  Articulatory copy synthesis from cine x-ray films , 2013, INTERSPEECH.

[9]  P. Birkholz Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis , 2013, PloS one.

[10]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[11]  Shrikanth S. Narayanan,et al.  Advances in real-time magnetic resonance imaging of the vocal tract for speech science and technology research , 2016, APSIPA Transactions on Signal and Information Processing.

[12]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[13]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[14]  Shrikanth S. Narayanan,et al.  Characterizing vocal tract dynamics with real-time MRI , 2015 .

[15]  Shrikanth S. Narayanan,et al.  Region Segmentation in the Frequency Domain Applied to Upper Airway Real-Time Magnetic Resonance Images , 2009, IEEE Transactions on Medical Imaging.

[16]  Panayiotis G. Georgiou,et al.  SailAlign: Robust long speech-text alignment , 2011 .

[17]  Osamu Fujimura,et al.  The C/D Model and Prosodic Control of Articulatory Behavior , 2000, Phonetica.

[18]  Shinji Maeda,et al.  A digital simulation method of the vocal-tract system , 1982, Speech Commun..

[19]  Ricardo Gutierrez-Osuna,et al.  Data driven articulatory synthesis with deep neural networks , 2016, Comput. Speech Lang..

[20]  Shrikanth S. Narayanan,et al.  Accelerated three‐dimensional upper airway MRI using compressed sensing , 2009, Magnetic resonance in medicine.

[21]  Fredericka Bell‐Berti,et al.  A Temporal Model of Speech Production , 1980, Phonetica.

[22]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[23]  Shrikanth S. Narayanan,et al.  Articulatory synthesis of French connected speech from EMA data , 2013, INTERSPEECH.

[24]  Didier Demolin,et al.  Mid-sagittal cut to area function transformations: Direct measurements of mid-sagittal distance and area with MRI , 2002, Speech Commun..

[25]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[26]  Yves Laprie,et al.  Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion. , 2005, The Journal of the Acoustical Society of America.