A Multilinear Tongue Model Derived from Speech Related MRI Data of the Human Vocal Tract

Abstract We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. To extract model parameters, we use a minimally supervised method based on an image segmentation approach and a template fitting technique. Furthermore, we use image denoising to deal with possibly corrupt data, palate surface information reconstruction to handle palatal tongue contacts, and a bootstrap strategy to refine the obtained shapes. Our evaluation shows that, by limiting the degrees of freedom for the anatomical and speech related variations, to 5 and 4, respectively, we obtain a model that can reliably register unknown data while avoiding overfitting effects. Furthermore, we show that it can be used to generate plausible tongue animation by tracking sparse motion capture data.

[1]  Shrikanth S. Narayanan,et al.  State-of-the-Art MRI Protocol for Comprehensive Assessment of Vocal Tract Structure and Function , 2016, INTERSPEECH.

[2]  Haibo Wang,et al.  An Improved 3D Geometric Tongue Model , 2016, INTERSPEECH.

[3]  M. Stone,et al.  Three-dimensional tongue surface shapes of English consonants and vowels. , 1996, The Journal of the Acoustical Society of America.

[4]  Shrikanth S. Narayanan,et al.  Factor analysis of vocal-tract outlines derived from real-time magnetic resonance imaging data , 2015, ICPhS.

[5]  Olov Engwall,et al.  Can audio-visual instructions help learners improve their articulation? - an ultrasound study of short term changes , 2008, INTERSPEECH.

[6]  Jayaram K. Udupa,et al.  Automatic segmentation of vocal tract MR images , 2013, 2013 IEEE 10th International Symposium on Biomedical Imaging.

[7]  Christine Mooshammer,et al.  How to stretch and shrink vowel systems: results from a vowel normalization procedure. , 2009, The Journal of the Acoustical Society of America.

[8]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[9]  Marie-Odile Berger,et al.  A guided approach for automatic segmentation and modeling of the vocal tract in MRI images , 2011, 2011 19th European Signal Processing Conference.

[10]  P. Ladefoged,et al.  Factor analysis of tongue shapes. , 1971, Journal of the Acoustical Society of America.

[11]  Osman Ratib,et al.  OsiriX: An Open-Source Software for Navigating in Multidimensional DICOM Images , 2004, Journal of Digital Imaging.

[12]  Jerry L. Prince,et al.  A high-resolution atlas and statistical model of the vocal tract from structural MRI , 2015, Comput. methods Biomech. Biomed. Eng. Imaging Vis..

[13]  Jonghye Woo,et al.  Variability in muscle activation of simple speech motions: A biomechanical modeling approach. , 2017, The Journal of the Acoustical Society of America.

[14]  Bernd J. Kröger,et al.  ESTIMATION OF VOCAL TRACT AREA FUNCTION FROM MAGNETIC RESONANCE IMAGING: PRELIMINARY RESULTS , 2000 .

[15]  Hong Wei,et al.  A survey of human motion analysis using depth imagery , 2013, Pattern Recognit. Lett..

[16]  Pierre Badin,et al.  Inter-Speaker Variability: Speaker Normalisation and Quantitative Estimation of Articulatory Invariants in Speech Production for French , 2017, INTERSPEECH.

[17]  Vin de Silva,et al.  Tensor rank and the ill-posedness of the best low-rank approximation problem , 2006, math/0607647.

[18]  Philip Hoole,et al.  Beyond 2D in articulatory data acquisition and analysis , 2003 .

[19]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[20]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[21]  Phil Hoole,et al.  Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus , 2011, INTERSPEECH.

[22]  Jieping Ye,et al.  Tensor Completion for Estimating Missing Values in Visual Data , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[24]  Tieniu Tan,et al.  Recent developments in human motion analysis , 2003, Pattern Recognit..

[25]  V. Kshirsagar,et al.  Face recognition using Eigenfaces , 2011, 2011 3rd International Conference on Computer Research and Development.

[26]  Pierre Badin,et al.  Predicting unseen articulations from multi-speaker articulatory models , 2010, INTERSPEECH.

[27]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[28]  Maureen Stone,et al.  Representing the tongue surface with curve fits , 1992, ICSLP.

[29]  P Ladefoged,et al.  Individual differences in vowel production. , 1993, The Journal of the Acoustical Society of America.

[30]  P. Perrier,et al.  A biomechanical model of cardinal vowel production: muscle activations and the impact of gravity on tongue positioning. , 2009, The Journal of the Acoustical Society of America.

[31]  Tokihiko Kaburagi Morphological and acoustic analysis of the vocal tract using a multi-speaker volumetric MRI dataset , 2015, INTERSPEECH.

[32]  Philip J. B. Jackson,et al.  Statistical identification of articulation constraints in the production of speech , 2009, Speech Commun..

[33]  Shrikanth S. Narayanan,et al.  Toward articulatory-acoustic models for liquid approximants based on MRI and EPG data. Part I. The laterals , 1997 .

[34]  D. Broadbent,et al.  Information Conveyed by Vowels , 1957 .

[35]  Pierre Badin,et al.  Normalisation articulatoire du locuteur par méthodes de décomposition tri-linéaire basées sur des données IRM (Articulatory speaker normalisation based on MRI-data using three-way linear decomposition methods) [in French] , 2012, JEP-TALN-RECITAL 2012.

[36]  Ting Peng,et al.  A shape-based framework to segmentation of tongue contours from MRI data , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Sébastien Le Maguer,et al.  An HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis , 2017, INTERSPEECH.

[38]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, SIGGRAPH 2009.

[39]  P. W. Nye,et al.  Analysis of vocal tract shape and dimensions using magnetic resonance imaging: vowels. , 1991, The Journal of the Acoustical Society of America.

[40]  Korin Richmond,et al.  A statistical shape space model of the palate surface trained on 3D MRI scans of the vocal tract , 2015, ICPhS.

[41]  R. Boubertakh,et al.  Towards clinical assessment of velopharyngeal closure using MRI: evaluation of real-time MRI sequences at 1.5 and 3 T. , 2012, The British journal of radiology.

[42]  Jerry L. Prince,et al.  Structure and variability in human tongue muscle anatomy , 2018, Comput. methods Biomech. Biomed. Eng. Imaging Vis..

[43]  Maxim Zaitsev,et al.  Acceleration of MRI of the vocal tract provides additional insight into articulator modifications , 2015, Journal of magnetic resonance imaging : JMRI.

[44]  Martin Styner,et al.  Evaluation of 3D Correspondence Methods for Model Building , 2003, IPMI.

[45]  Pascal Perrier,et al.  Do Speakers' Vocal Tract Geometries Shape their Articulatory Vowel Space? , 2008 .

[46]  Pascal Perrier,et al.  On the relationship between palate shape and articulatory behavior. , 2009, The Journal of the Acoustical Society of America.

[47]  M. Alex O. Vasilescu Human motion signatures: analysis, synthesis, recognition , 2002, Object recognition supported by user interaction for service robots.

[48]  Shrikanth S. Narayanan,et al.  An articulatory study of fricative consonants using magnetic resonance imaging , 1995 .

[49]  Anuj Srivastava,et al.  Statistical Shape Analysis , 2014, Computer Vision, A Reference Guide.

[50]  S. Ouni,et al.  Investigating the effects of posture and noise on speech production , 2014 .

[51]  Yana Yunusova,et al.  The effect of anatomic factors on tongue position variability during consonants. , 2013, Journal of speech, language, and hearing research : JSLHR.

[52]  John N. Carter,et al.  Dynamic Magnetic Resonance Imaging: new tools for speech research , 1999 .

[53]  M A Rodrigues,et al.  A Biomechanical Model of the Upper Airways for Simulating Laryngoscopy , 2001, Computer methods in biomechanics and biomedical engineering.

[54]  Olov Engwall,et al.  A 3d tongue model based on MRI data , 2000, INTERSPEECH.

[55]  Shrikanth Narayanan,et al.  A fast and flexible MRI system for the study of dynamic vocal tract shaping , 2017, Magnetic resonance in medicine.

[56]  Arne Kjell Foldvik,et al.  A time-evolving three-dimensional vocal tract model by means of magnetic resonance imaging (MRI) , 1993, EUROSPEECH.

[57]  Jerry L. Prince,et al.  Construction of An Unbiased Spatio-Temporal Atlas of the Tongue During Speech , 2015, IPMI.

[58]  P. Ladefoged A course in phonetics , 1975 .

[59]  Shrikanth S. Narayanan,et al.  Accelerated three‐dimensional upper airway MRI using compressed sensing , 2009, Magnetic resonance in medicine.

[60]  K. Mardia,et al.  Statistical Shape Analysis , 1998 .

[61]  Alex Pentland,et al.  LAFTER: a real-time face and lips tracker with facial expression recognition , 2000, Pattern Recognit..

[62]  Mark Tiede,et al.  A shape‐based approach to vocal tract area function estimation , 1996 .

[63]  Didier Demolin,et al.  REAL TIME MRI AND ARTICULATORY COORDINATIONS IN VOWELS , 2000 .

[64]  Pierre Alliez,et al.  Polygon Mesh Processing , 2010 .

[65]  Jens Frahm,et al.  Real‐time MRI of speaking at a resolution of 33 ms: Undersampled radial FLASH with nonlinear inverse reconstruction , 2013, Magnetic resonance in medicine.

[66]  Zhi-Pei Liang,et al.  High‐resolution dynamic speech imaging with joint low‐rank and sparsity constraints , 2015, Magnetic resonance in medicine.

[67]  Shinji Maeda,et al.  Human palate and related structures: their articulatory consequences , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[68]  G. Bailly,et al.  Linear degrees of freedom in speech production: analysis of cineradio- and labio-film data and articulatory-acoustic modeling. , 2001, The Journal of the Acoustical Society of America.

[69]  Gérard Bailly,et al.  Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[70]  Joachim Weickert,et al.  Anisotropic diffusion in image processing , 1996 .

[71]  Rafael Laboissière,et al.  Effects of higher order propagation modes in vocal tract like geometries. , 2015, The Journal of the Acoustical Society of America.

[72]  Timo Bolkart,et al.  3D faces in motion: Fully automatic registration and statistical analysis , 2015, Comput. Vis. Image Underst..

[73]  Jianwu Dang,et al.  Iterative method to estimate muscle activation with a physiological articulatory model , 2014 .

[74]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[75]  Henk A. L. Kiers,et al.  An efficient algorithm for PARAFAC of three-way data with large numbers of observation units , 1991 .

[76]  Jerry L. Prince,et al.  Semi-automatic segmentation of the tongue for 3D motion analysis with dynamic MRI , 2013, 2013 IEEE 10th International Symposium on Biomedical Imaging.

[77]  Tieniu Tan,et al.  People tracking based on motion model and motion constraints with automatic initialization , 2004, Pattern Recognit..

[78]  Yves Laprie,et al.  High spatiotemporal cineMRI films using compressed sensing for acquiring articulatory data , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[79]  Gérard Bailly,et al.  A three-dimensional linear articulatory model based on MRI data , 1998, ICSLP.

[80]  Yohan Payan,et al.  Atlas-Based Automatic Generation of Subject-Specific Finite Element Tongue Meshes , 2015, Annals of Biomedical Engineering.

[81]  Leonardo Lancia,et al.  Inter-speaker articulatory variability during vowel-consonant-vowel sequences in twins and unrelated speakers. , 2013, The Journal of the Acoustical Society of America.

[82]  Susanne Fuchs,et al.  Palatal morphology can influence speaker-specific realizations of phonemic contrasts. , 2013, Journal of speech, language, and hearing research : JSLHR.

[83]  Patricia A. Keating,et al.  CORONAL PLACES OF ARTICULATION , 1991 .

[84]  P. Perrier,et al.  Simulations of the consequences of tongue surgery on tongue mobility: implications for speech production in post‐surgery conditions , 2007, The international journal of medical robotics + computer assisted surgery : MRCAS.

[85]  Pierre Badin,et al.  Three-dimensional linear modeling of tongue: Articulatory data and models , 2006 .

[86]  Christian Kroos,et al.  Analysis of tongue configuration in multi-speaker, multi-volume MRI data , 2000 .

[87]  Stefanie Wuhrer,et al.  A hybrid approach to 3d tongue modeling from vocal tract MRI using unsupervised image segmentation and mesh deformation , 2014, INTERSPEECH.

[88]  Sidney S. Fels,et al.  3D segmentation of the tongue in MRI: a minimally interactive model-based approach , 2015, Comput. methods Biomech. Biomed. Eng. Imaging Vis..

[89]  Zoran Popovic,et al.  The space of human body shapes: reconstruction and parameterization from range scans , 2003, ACM Trans. Graph..

[90]  J. Rosenthal,et al.  Positional targets for lingual consonants defined using electromagnetic articulography. , 2012, The Journal of the Acoustical Society of America.

[91]  Mark Hasegawa-Johnson,et al.  Analysis of the three-dimensional tongue shape using a three-index factor analysis model. , 2003, The Journal of the Acoustical Society of America.

[92]  A. Alwan,et al.  Toward articulatory-acoustic models for liquid approximants based on MRI and EPG data. Part I. The laterals. , 1997, The Journal of the Acoustical Society of America.

[93]  Pierre Badin,et al.  Articulatory speaker normalisation based on MRI-data using three-way linear decomposition methods , 2012, INTERSPEECH.

[94]  Pierre Badin,et al.  Collecting and analysing two- and three- dimensional MRI data for Swedish , 1999 .

[95]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, ACM Trans. Graph..