Vocal tract shaping of emotional speech

Emotional speech production has been previously studied using fleshpoint tracking data in speaker-specific experiment setups. The present study introduces a real-time magnetic resonance imaging database of emotional speech production from 10 speakers and presents articulatory analysis results of speech emotional expression using the database. Midsagittal vocal tract parameters (midsagittal distances and the vocal tract length) were parameterized based on a two-dimensional grid-line system, using image segmentation software. The principal feature analysis technique was applied to the grid-line system in order to find the major movement locations. Results reveal both speaker-dependent and speaker-independent variation patterns. For example, sad speech, a low arousal emotion, tends to show smaller opening for low vowels in the front cavity than the high arousal emotions more consistently than the other regions of the vocal tract. Happiness shows significantly shorter vocal tract length than anger and sadness in most speakers. Further details of speaker-dependent and speaker-independent speech articulation variation in emotional expression and their implications are described.

[1]  Brad H. Story,et al.  Parameterization of vocal tract area functions by empirical orthogonal modes , 1998 .

[2]  Xueying Zhang,et al.  Articulatory-Acoustic Analyses of Mandarin Words in Emotional Context Speech for Smart Campus , 2018, IEEE Access.

[3]  A Fourier series description of the tongue profile , 2007 .

[4]  E. Hoffman,et al.  Vocal tract area functions from magnetic resonance imaging. , 1996, The Journal of the Acoustical Society of America.

[5]  Kiyoshi Honda,et al.  Principal components of vocal-tract area functions and inversion of vowels by linear regression of cepstrum coefficients , 2007, J. Phonetics.

[6]  J. Kelso,et al.  A qualitative dynamic analysis of reiterant speech production: phase portraits, kinematics, and dynamic modeling. , 1985, The Journal of the Acoustical Society of America.

[7]  Lukás Burget,et al.  Application of speaker- and language identification state-of-the-art techniques for emotion recognition , 2011, Speech Commun..

[8]  Donna Erickson,et al.  Some articulatory measurements of real sadness , 2004, INTERSPEECH.

[9]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[10]  A. Paeschke,et al.  F0-CONTOURS IN EMOTIONAL SPEECH , 1999 .

[11]  Shrikanth S. Narayanan,et al.  A study of emotional speech articulation using a fast magnetic resonance imaging technique , 2006, INTERSPEECH.

[12]  Panayiotis G. Georgiou,et al.  SailAlign: Robust long speech-text alignment , 2011 .

[13]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Qi Tian,et al.  Feature selection using principal feature analysis , 2007, ACM Multimedia.

[15]  Shrikanth S. Narayanan,et al.  Flexible retrospective selection of temporal resolution in real‐time speech MRI using a golden‐ratio spiral view order , 2011, Magnetic resonance in medicine.

[16]  Shrikanth S. Narayanan,et al.  An articulatory study of emotional speech production , 2005, INTERSPEECH.

[17]  Roy D Patterson,et al.  The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex, and age. , 2005, The Journal of the Acoustical Society of America.

[18]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[19]  Eva Lasarcyk Spread Lips + Raised Larynx + Higher F0 = Smiled Speech? - An Articulatory Synthesis Approach , 2008 .

[20]  Suthathip Chuenwattanapranithi,et al.  PERCEIVING ANGER AND JOY IN SPEECH THROUGH THE SIZE CODE , 2007 .

[21]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[22]  P. Ladefoged,et al.  Factor analysis of tongue shapes. , 1971, Journal of the Acoustical Society of America.

[23]  S. Ohman Numerical model of coarticulation. , 1967, The Journal of the Acoustical Society of America.

[24]  Shrikanth S. Narayanan,et al.  A kinematic study of critical and non-critical articulators in emotional speech production. , 2015, The Journal of the Acoustical Society of America.

[25]  J.M. Santos,et al.  Flexible real-time magnetic resonance imaging framework , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[26]  Athanasios Katsamanis,et al.  Rapid semi-automatic segmentation of real-time magnetic resonance images for parametric vocal tract analysis , 2010, INTERSPEECH.

[27]  O. Fujimura,et al.  Articulatory Correlates of Prosodic Control: Emotion and Emphasis , 1998, Language and speech.

[28]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[29]  Donna Erickson,et al.  Articulation, Acoustics and Perception of Mandarin Chinese Emotional Speech , 2016 .

[30]  Yoon-Chul Kim,et al.  Seeing speech: Capturing vocal tract shaping using real-time magnetic resonance imaging [Exploratory DSP] , 2008, IEEE Signal Processing Magazine.

[31]  Shrikanth S. Narayanan,et al.  A study of interplay between articulatory movement and prosodic characteristics in emotional speech production , 2010, INTERSPEECH.

[32]  Astrid Paeschke,et al.  Articulatory reduction in emotional speech , 1999, EUROSPEECH.

[33]  Brad H Story,et al.  Vowel and consonant contributions to vocal tract shape. , 2009, The Journal of the Acoustical Society of America.

[34]  Shrikanth S. Narayanan,et al.  Enhanced airway-tissue boundary segmentation for real-time magnetic resonance imaging data , 2014 .

[35]  Shrikanth Narayanan,et al.  Synchronized and noise-robust audio recordings during realtime magnetic resonance imaging scans. , 2006, The Journal of the Acoustical Society of America.

[36]  Jun Cai,et al.  Articulatory modeling based on semi-polar coordinates and guided PCA technique , 2009, INTERSPEECH.

[37]  Shrikanth Narayanan,et al.  USC-EMO-MRI corpus: An emotional speech production database recorded by real-time magnetic resonance imaging , 2014 .

[38]  Donna Erickson,et al.  Exploratory Study of Some Acoustic and Articulatory Characteristics of Sad Speech , 2006, Phonetica.

[39]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[40]  Shrikanth S. Narayanan,et al.  An Exploratory Study of the Relations Between Perceived Emotion Strength and Articulatory Kinematics , 2011, INTERSPEECH.

[41]  Shrikanth Narayanan,et al.  Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). , 2014, The Journal of the Acoustical Society of America.

[42]  V. Tartter Happy talk: Perceptual and acoustic effects of smiling on speech , 1980, Perception & psychophysics.

[43]  Dik J. Hermes,et al.  Expression of emotion and attitude through temporal speech variations , 2000, INTERSPEECH.

[44]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[45]  Kazuya Takeda,et al.  An Acoustically Oriented Vocal-Tract Model , 1996 .