Can static vocal tract positions represent articulatory targets in continuous speech? Matching static MRI captures against real-time MRI for the French language

This paper uses mediosagittal slices of a static magnetic resonance imaging (MRI) dataset capturing the blocked articulation of vowels and of consonants that anticipate /a, i, u, y/ and a variety of other vowels to study the presence and distinctness of these deliberately taken articu-latory targets in real-time MRI recordings. The study investigates whether such articulatory targets are actually attained in fluent speech, how marked they are, and what factors influence the degree of similarity between a given articulatory target and the actual vocal tract shape. To quantify the similarity, we use structural similarity, Wasserstein distance, and SIFT measure. We analyze the amplitude and timing of the observed similarity peaks across different phonetic classes and speech types (spon-taneous versus not). We show that although real-time speech involves shapes quite similar to the static data, there is a great intra-and inter-speaker variability.

[1]  Shrikanth Narayanan,et al.  Speed Accuracy Tradeoffs in Speech Production , 2017 .

[2]  Shrikanth S. Narayanan,et al.  Articulatory Synthesis Based on Real-Time Magnetic Resonance Imaging Data , 2016, INTERSPEECH.

[3]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[4]  Anastasiia Tsukanova,et al.  Articulatory Speech Synthesis from Static Context-Aware Articulatory Targets , 2017, ISSP.

[5]  Qiguang Lin Speech production theory and articulatory speech synthesis , 1991 .

[6]  Shrikanth S. Narayanan,et al.  Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories , 2016, Comput. Speech Lang..

[7]  Anastasiia Tsukanova,et al.  Centerline articulatory models of the velum and epiglottis for articulatory synthesis of speech , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[8]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[9]  Shrikanth S. Narayanan,et al.  Analysis of speech production real-time MRI , 2018, Comput. Speech Lang..

[10]  Pascal Perrier What goals for articulatory speech synthesis , 2017 .

[11]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[12]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[13]  S. Ohman Numerical model of coarticulation. , 1967, The Journal of the Acoustical Society of America.

[14]  F. Mussa-Ivaldi Motor Primitives , Force-Fields and the Equilibrium Point Theory , .

[15]  Jens Frahm,et al.  Real‐time MRI of speaking at a resolution of 33 ms: Undersampled radial FLASH with nonlinear inverse reconstruction , 2013, Magnetic resonance in medicine.

[16]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[17]  P. Birkholz Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis , 2013, PloS one.

[18]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[19]  Olov Engwall,et al.  Are static MRI measurements representative of dynamic speech? results from a comparative study using MRI, EPG and EMA , 2000, Interspeech.