A three-layered model for expressive speech perception

This paper proposes a multi-layer approach to modeling perception of expressive speech. Many earlier studies of expressive speech focused on statistical correlations between expressive speech and acoustic features without taking into account the fact that human perception is vague rather than precise. This paper introduces a three-layer model: five categories of expressive speech constitute the top layer, semantic primitives constitute the middle layer, and acoustic features, the bottom layer. Three experiments followed by multidimensional scaling analysis revealed suitable semantic primitives. Then, fuzzy inference systems were built to map the vagueness of the relationship between expressive speech and the semantic primitives. Acoustic features in terms of F0 contour, time duration, power envelope, and spectrum were analyzed. Regression analysis revealed correlation between the semantic primitives and the acoustic features. Parameterized rules based on the analysis results were created to morph neutral utterances to those perceived as having different semantic primitives and expressive speech categories. Experiments to verify the relationships of the model showed significant relationships between expressive speech, semantic primitives, and acoustic features.

[1]  Patricia A. Keating,et al.  Linguistic Voice Quality , 2006 .

[2]  L. Lamel,et al.  Emotion detection in task-oriented spoken dialogues , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[3]  菅野 道夫,et al.  Industrial applications of fuzzy control , 1985 .

[4]  Kim E. A. Silverman,et al.  Vocal cues to speaker affect: testing two models , 1984 .

[5]  P. Juslin Communicating emotion in music performance: A review and a theoretical framework , 2001 .

[6]  Keikichi Hirose,et al.  Prosodic characteristics of a spoken dialogue for information query , 1994, ICSLP.

[7]  Thomas A. Sebeok,et al.  Nonverbal communication, interaction, and gesture : selections from Semiotica , 1981 .

[8]  Elmar Nöth,et al.  How to find trouble in communication , 2003, Speech Commun..

[9]  Kikuo Maekawa,et al.  Paralinguistic effects on voice quality: a study in Japanese , 2006, Speech Prosody 2006.

[10]  Kikuo Maekawa Production and Perception of ‘Paralinguistic’ Information , 2003 .

[11]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[12]  Jennifer J. Venditti,et al.  The J_ToBi Model of Japanese Intonation , 2005 .

[13]  J. Laver,et al.  The handbook of phonetic sciences , 1999 .

[14]  Masato Akagi,et al.  Toward a Rule-Based Synthesis of Emotional Speech on Linguistic Descriptions of Perception , 2005, ACII.

[15]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[16]  Donna Erickson,et al.  Expressive speech: Production, perception and application to speech synthesis , 2005 .

[17]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[18]  K. Stevens,et al.  Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[19]  P. C. Nguyen Modified Restricted Temporal Decomposition and Its Application to Low Rate Speech Coding , 2003 .

[20]  K. Scherer,et al.  Effect of experimentally induced stress on vocal parameters. , 1986, Journal of experimental psychology. Human perception and performance.

[21]  Graham Darke,et al.  Assessment of Timbre Using Verbal Attributes , 2005 .

[22]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[23]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[24]  Peter Robinson,et al.  Visualizing dynamic features of expressions in speech , 2004, INTERSPEECH.

[25]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[26]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[27]  Marcelo M. Wanderley,et al.  Indirect Acquisition of Instrumental Gesture Based on Signal, Physical and Perceptual Information , 2003, NIME.

[28]  J. Sundberg,et al.  A Fuzzy Analyzer of Emotional Expression in Music Performance and Body Motion , 2004 .

[29]  Miriam Kienast,et al.  Acoustical analysis of spectral and temporal changes in emotional speech , 2000 .

[30]  G. Fauconnier Mappings in thought and language , 1997 .

[31]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[32]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[33]  J. Sloboda,et al.  Music and emotion: Theory and research , 2001 .

[34]  Björn Vickhoff,et al.  Why does music move us , 2004 .

[35]  Stephen L. Chiu,et al.  Fuzzy Model Identification Based on Cluster Estimation , 1994, J. Intell. Fuzzy Syst..

[36]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[37]  Klaus R. Scherer,et al.  The role of intonation in emotional expressions , 2005, Speech Commun..

[38]  Olaf Wolkenhauer Data engineering - fuzzy mathematics in systems theory and data analysis , 2001 .

[39]  Kanzuo Ueda A hierarchical structure for adjectives describing timbre , 1996 .

[40]  M Akagi,et al.  Sharpness and amplitude envelopes of broadband noise. , 1990, The Journal of the Acoustical Society of America.

[41]  Peter K. Manning,et al.  Symbolic Communication: Signifying Calls and the Police Response , 1989 .

[42]  M. Cole Cross-cultural universals of affective meaning. , 1976 .

[43]  Roddy Cowie,et al.  Acoustic correlates of emotion dimensions in view of speech synthesis , 2001, INTERSPEECH.

[44]  J. Sundberg,et al.  Overview of the KTH rule system for musical performance. , 2006 .

[45]  Roddy Cowie,et al.  Automatic statistical analysis of the signal and prosodic signs of emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.