Voice conversion for emotional speech: Rule-based synthesis with degree of emotion controllable in dimensional space

Abstract This paper proposes a rule-based voice conversion system for emotion which is capable of converting neutral speech to emotional speech using dimensional space (arousal and valence) to control the degree of emotion on a continuous scale. We propose an inverse three-layered model with acoustic features as output at the top layer, semantic primitives at the middle layer and emotion dimension as input at the bottom layer; an adaptive-based fuzzy inference system acts as connectors to extract the non-linear rules among the three layers. The rules are applied by modifying the acoustic features of neutral speech to create the different types of emotional speech. The prosody-related acoustic features of F0 and power envelope are parameterized using the Fujisaki model and target prediction model separately. Perceptual evaluation results show that the degree of emotion can be perceived well in the dimensional space of valence and arousal.

[1]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[2]  F DejongEstienne,et al.  Voice and emotion , 1991 .

[3]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[4]  Simon King,et al.  Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech , 2010, Speech Commun..

[5]  Donna Erickson,et al.  Expressive speech: Production, perception and application to speech synthesis , 2005 .

[6]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[7]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[8]  J. Russell,et al.  Concept of Emotion Viewed From a Prototype Perspective , 1984 .

[9]  Keikichi Hirose,et al.  Comparison of Emotion Perception among Different Cultures , 2009 .

[10]  Masato Akagi,et al.  Emotional speech synthesis system based on a three-layered model using a dimensional approach , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[11]  Tetsuya Takiguchi,et al.  Emotional voice conversion using deep neural networks with MCC and F0 features , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[12]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  C. Whissell,et al.  A Dictionary of Affect in Language: III. Analysis of Two Biblical and Two Secular Passages , 1986 .

[14]  E. Brunswik,et al.  Historical and Thematic Relations of Psychology to Other Sciences , 1956 .

[15]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[16]  Ailbhe Ní Chasaide,et al.  Analysis of intonation contours in portrayed emotions using the Fujisaki model , 2007 .

[17]  Michel Belyk,et al.  The acoustic correlates of valence depend on emotion family. , 2014, Journal of voice : official journal of the Voice Foundation.

[18]  Jyh-Shing Roger Jang,et al.  ANFIS: adaptive-network-based fuzzy inference system , 1993, IEEE Trans. Syst. Man Cybern..

[19]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[20]  Roddy Cowie,et al.  Acoustic correlates of emotion dimensions in view of speech synthesis , 2001, INTERSPEECH.

[21]  Masato Akagi,et al.  Spectrum target prediction model and its application to speech recognition , 1990 .

[22]  Masato Akagi,et al.  A study on perception of emotional states in multiple languages on Valence-Activation approach , 2015 .

[23]  Rosalind W. Picard Affective Computing , 1997 .

[24]  Masato Akagi,et al.  Voice conversion to emotional speech based on three-layered model in dimensional approach and parameterization of dynamic features in prosody , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[25]  Hansjörg Mixdorff,et al.  A novel approach to the fully automatic extraction of Fujisaki model parameters , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[26]  K. Scherer,et al.  Path Models of Vocal Emotion Communication , 2015, PloS one.

[27]  H. Conte,et al.  Circumplex models of personality and emotions , 1997 .

[28]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  K. Sreenivasa Rao,et al.  Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language , 2016, Circuits Syst. Signal Process..

[30]  Marc Schröder,et al.  Expressing degree of activation in synthetic speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  K. Scherer Emotion as a multicomponent process: A model and some cross-cultural data. , 1984 .

[32]  P. Juslin,et al.  Cue Utilization in Communication of Emotion in Music Performance: Relating Performance to Perception Studies of Music Performance , 2022 .

[33]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[34]  Marc Schröder,et al.  Expressive Speech Synthesis: Past, Present, and Possible Futures , 2009, Affective Information Processing.

[35]  Aoju Chen,et al.  Language-Specificity in the Perception of Paralinguistic Intonational Meaning , 2004, Language and speech.

[36]  Masato Akagi,et al.  Improving speech emotion dimensions estimation using a three-layer model of human perception , 2014 .

[37]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[38]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[39]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[40]  Anne Lacheret,et al.  The role of voice quality and prosodic contour in affective speech perception , 2012, Speech Commun..

[41]  Steve J. Young,et al.  Data-driven emotion conversion in spoken English , 2009, Speech Commun..

[42]  Masato Akagi,et al.  Multilingual Speech Emotion Recognition System Based on a Three-Layer Model , 2016, INTERSPEECH.

[43]  Hiroya Fujisaki,et al.  Information, prosody, and modeling - with emphasis on tonal features of speech - , 2004, Speech Prosody 2004.

[44]  V. Aubergé,et al.  Multimodal Indices to Japanese and French Prosodically Expressed Social Affects , 2009, Language and speech.

[45]  Tieniu Tan,et al.  Affective Computing: A Review , 2005, ACII.

[46]  Nangyeon Lim,et al.  Cultural differences in emotion: differences in emotional arousal level between the East and the West , 2016, Integrative medicine research.

[47]  H. Schlosberg Three dimensions of emotion. , 1954, Psychological review.

[48]  Tomoki Toda,et al.  GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[49]  K. Kroschel,et al.  Emotion Estimation in Speech Using a 3D Emotion Space Concept , 2007 .

[50]  F. Biassoni,et al.  Hot or Cold Anger? Verbal and Vocal Expression of Anger While Driving in a Simulated Anger-Provoking Scenario , 2016 .

[51]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Masato Akagi,et al.  A three-layered model for expressive speech perception , 2008, Speech Commun..

[53]  Ibon Saratxaga,et al.  Emotion Conversion Based on Prosodic Unit Selection , 2010, IEEE Transactions on Audio, Speech, and Language Processing.