Using Silence MR Image to Synthesise Dynamic MRI Vocal Tract Data of CV

In this work we present an algorithm for synthesising pseudo rtMRI data of the vocal tract. rtMRI data on the midsagittal plane were used to synthesise target consonant-vowel (CV) using only a silence frame of the target speaker. For this purpose, several single speaker models were created. The input of the algorithm is a silence frame of both train and target speaker and the rtMRI data of the target CV. An image transformation is computed from each CV frame to the next one, creating a set of transformations that describe the dynamics of the CV production. Another image transformation is computed from the silence frame of train speaker to the silence frame of the target speaker and is used to adapt the set of transformations computed previously to the target speaker. The adapted set of transformations is applied to the silence of the target speaker to synthesise his/her CV pseudo rtMRI data. Synthesised images from multiple single speaker models are frame aligned and then averaged to create the final version of synthesised images. Synthesised images are compared with the original ones using image cross-correlation. Results show good agreement between the synthesised and the original images.

[1]  Jens Frahm,et al.  Real‐time MRI at a resolution of 20 ms , 2010, NMR in biomedicine.

[2]  Anastasiia Tsukanova,et al.  Centerline articulatory models of the velum and epiglottis for articulatory synthesis of speech , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[3]  Shrikanth S. Narayanan,et al.  Advances in real-time magnetic resonance imaging of the vocal tract for speech science and technology research , 2016, APSIPA Transactions on Signal and Information Processing.

[4]  Yves Laprie,et al.  Extension of the single-matrix formulation of the vocal tract: Consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink , 2016, Speech Commun..

[5]  Tom Vercauteren,et al.  Diffeomorphic demons: Efficient non-parametric image registration , 2009, NeuroImage.

[6]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[7]  Yu Xie,et al.  MRI Vocal Tract Sagittal Slices Estimation During Speech Production of CV , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).

[8]  Shrikanth Narayanan,et al.  3D dynamic MRI of the vocal tract during natural speech , 2018, Magnetic resonance in medicine.

[9]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[10]  Jean-Philippe Thirion,et al.  Image matching as a diffusion process: an analogy with Maxwell's demons , 1998, Medical Image Anal..

[11]  W J Hardcastle,et al.  The Use of Electropalatography in Phonetic Research , 1972, Phonetica.

[12]  François Cotton,et al.  Dynamic MRI of larynx and vocal fold vibrations in normal phonation. , 2009, Journal of voice : official journal of the Voice Foundation.

[13]  Jens Frahm,et al.  Real‐time MRI of speaking at a resolution of 33 ms: Undersampled radial FLASH with nonlinear inverse reconstruction , 2013, Magnetic resonance in medicine.

[14]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[15]  Didier Demolin,et al.  Real-time MRI and articulatory coordination in speech. , 2002, Comptes rendus biologies.

[16]  Shrikanth S. Narayanan,et al.  Analysis of speech production real-time MRI , 2018, Comput. Speech Lang..

[17]  Zhi-Pei Liang,et al.  A real-time MRI investigation of the role of lingual and pharyngeal articulation in the production of the nasal vowel system of French , 2015, J. Phonetics.

[18]  M H Cohen,et al.  Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements. , 1992, The Journal of the Acoustical Society of America.

[19]  M Stone,et al.  A head and transducer support system for making ultrasound images of tongue/jaw movement. , 1995, The Journal of the Acoustical Society of America.

[20]  Eric Vatikiotis-Bateson,et al.  The Haskins optically corrected ultrasound system (HOCUS). , 2005, Journal of speech, language, and hearing research : JSLHR.

[21]  Anastasiia Tsukanova,et al.  Towards a Method of Dynamic Vocal Tract Shapes Generation by Combining Static 3D and Dynamic 2D MRI Speech Data , 2019, INTERSPEECH.