Generation of Emotion Control Vector Using MDS-Based Space Transformation for Expressive Speech Synthesis

In control vector-based expressive speech synthesis, the emotion/style control vector defined in the categorical (CAT) emotion space is uneasy to be precisely defined by the user to synthesize the speech with the desired emotion/style. This paper applies the arousal-valence (AV) space to the multiple regression hidden semi-Markov model (MRHSMM)-based synthesis framework for expressive speech synthesis. In this study, the user can designate a specific emotion by defining the AV values in the AV space. The multidimensional scaling (MDS) method is adopted to project the AV emotion space and the categorical (CAT) emotion space onto their corresponding orthogonal coordinate systems. A transformation approach is thus proposed to transform the AV values to the emotion control vector in CAT emotion space for MRHSMM-based expressive speech synthesis. In the synthesis phase given the input text and desired emotion, with the transformed emotion control vector, the speech with the desired emotion is generated from the MRHSMMs. Experimental result shows the proposed method is helpful for the user to easily and precisely determine the desired emotion for expressive speech synthesis.

[1]  J. Russell A circumplex model of affect. , 1980 .

[2]  Takao Kobayashi,et al.  Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Chih-Chieh Chuang,et al.  Multidimensional scaling for fast speaker clustering , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[5]  Chung-Hsien Wu,et al.  Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion , 2007, IEEE Transactions on Computers.

[6]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[7]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[8]  Trevor F. Cox,et al.  Metric multidimensional scaling , 2000 .

[9]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[11]  Takashi Nose,et al.  A Perceptual Expressivity Modeling Technique for Speech Synthesis Based on Multiple-Regression HSMM , 2011, INTERSPEECH.

[12]  Takashi Nose,et al.  Recent Development of HMM-Based Expressive Speech Synthesis and Its Applications , 2011 .

[13]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[15]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Junichi Yamagishi,et al.  HMM-BASED EXPRESSIVE SPEECH SYNTHESIS — TOWARDS TTS WITH ARBITRARY SPEAKING STYLES AND EMOTIONS , 2003 .

[17]  Takao Kobayashi,et al.  Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing , 2005, IEICE Trans. Inf. Syst..

[18]  Shigeki Sagayama,et al.  Multiple-regression hidden Markov model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[19]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[20]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[21]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Chung-Hsien Wu,et al.  Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[24]  Hagai Aronowitz,et al.  A distance measure between GMMs based on the unscented transform and its application to speaker recognition , 2005, INTERSPEECH.

[25]  J. Yamagishi,et al.  HMM-Based Speech Synthesis with Various Speaking Styles Using Model Interpolation , 2004 .

[26]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[28]  Takao Kobayashi,et al.  A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features , 2006, IEICE Trans. Inf. Syst..