USC-EMO-MRI corpus: An emotional speech production database recorded by real-time magnetic resonance imaging

This paper introduces a new multimodal database of emotional speech production recorded using real-time magnetic resonance imaging. This corpus contains magnetic resonance (MR) videos of five male and five female speakers and the results of evaluation for the emotion quality of each sentence-level utterance, performed by at least 10 listeners. Both speakers and listeners are professional actors/actresses. The MR videos contain MR image sequences of the entire upper airway in the mid-sagittal plane and synchronized speech audios after noise cancellation. The stimuli comprises the “Grandfather“ passage and seven sentences. A single repetition of the passage and five repetitions of the sentences were recorded five times, each time with ad ifferent acted emotion. The four target emotions are anger, happiness, sadness and neutrality (no emotion). Additionally one repetition of the Grandfather passage was recorded in a neutral emotion and fast speaking rate, as opposed to a natural speaking rate for the rest of the recordings. This paper also includes a preliminary analysis of the MR images to illustrate how vocal tract configurations, measured in terms of distances between inner and outer vocal-tract walls along the tract, vary as a function of emotion.

[1]  Athanasios Katsamanis,et al.  A Multimodal Real-Time MRI Articulatory Corpus for Speech Research , 2011, INTERSPEECH.

[2]  Carlos Busso,et al.  The expression and perception of emotions: comparing assessments of self versus others , 2008, INTERSPEECH.

[3]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[4]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[5]  Shrikanth S. Narayanan,et al.  An articulatory study of emotional speech production , 2005, INTERSPEECH.

[6]  Shrikanth S. Narayanan,et al.  A study of emotional information present in articulatory movements estimated using acoustic-to-articulatory inversion , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[7]  Jeffrey J Berry,et al.  Accuracy of the NDI wave speech research system. , 2011, Journal of speech, language, and hearing research : JSLHR.

[8]  Brad H Story,et al.  Synergistic modes of vocal tract articulation for American English vowels. , 2005, The Journal of the Acoustical Society of America.

[9]  Shrikanth Narayanan,et al.  Synchronized and noise-robust audio recordings during realtime magnetic resonance imaging scans. , 2006, The Journal of the Acoustical Society of America.

[10]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[11]  Shrikanth S. Narayanan,et al.  An Exploratory Study of the Relations Between Perceived Emotion Strength and Articulatory Kinematics , 2011, INTERSPEECH.

[12]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[13]  Shrikanth Narayanan,et al.  A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[14]  Shrikanth S. Narayanan,et al.  Enhanced airway-tissue boundary segmentation for real-time magnetic resonance imaging data , 2014 .

[15]  Shrikanth S. Narayanan,et al.  A study of emotional speech articulation using a fast magnetic resonance imaging technique , 2006, INTERSPEECH.

[16]  Donna Erickson,et al.  Exploratory Study of Some Acoustic and Articulatory Characteristics of Sad Speech , 2006, Phonetica.

[17]  Yi Xu,et al.  In defense of lab speech , 2010, J. Phonetics.

[18]  O. Fujimura,et al.  Articulatory Correlates of Prosodic Control: Emotion and Emphasis , 1998, Language and speech.

[19]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[20]  Yoon-Chul Kim,et al.  Seeing speech: Capturing vocal tract shaping using real-time magnetic resonance imaging [Exploratory DSP] , 2008, IEEE Signal Processing Magazine.

[21]  Donna Erickson,et al.  Some articulatory measurements of real sadness , 2004, INTERSPEECH.

[22]  Shrikanth S. Narayanan,et al.  A study of interplay between articulatory movement and prosodic characteristics in emotional speech production , 2010, INTERSPEECH.