Emotion transplantation through adaptation in HMM-based speech synthesis

HighlightsWe propose an emotion transplantation method based on adaptation techniques.Emotions can be imbued into neutral synthetic speech models regardless of gender.Five perceptual evaluations, including one with a robot, were carried out.Emotion transplantation clearly improves emotional performance over neutral voices.High quality source models provide high quality transplanted models. This paper proposes an emotion transplantation method capable of modifying a synthetic speech model through the use of CSMAPLR adaptation in order to incorporate emotional information learned from a different speaker model while maintaining the identity of the original speaker as much as possible. The proposed method relies on learning both emotional and speaker identity information by means of their adaptation function from an average voice model, and combining them into a single cascade transform capable of imbuing the desired emotion into the target speaker. This method is then applied to the task of transplanting four emotions (anger, happiness, sadness and surprise) into 3 male speakers and 3 female speakers and evaluated in a number of perceptual tests. The results of the evaluations show how the perceived naturalness for emotional text significantly favors the use of the proposed transplanted emotional speech synthesis when compared to traditional neutral speech synthesis, evidenced by a big increase in the perceived emotional strength of the synthesized utterances at a slight cost in speech quality. A final evaluation with a robotic laboratory assistant application shows how by using emotional speech we can significantly increase the students' satisfaction with the dialog system, proving how the proposed emotion transplantation system provides benefits in real applications.

[1]  Silvia Quazza,et al.  Towards emotional speech synthesis: a rule based approach , 2004, SSW.

[2]  Kallirroi Georgila,et al.  Prediction and Realisation of Conversational Characteristics by Utilising Spontaneous Speech for Unit Selection , 2010 .

[3]  Simon King,et al.  Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech , 2010, Speech Commun..

[4]  John Makhoul,et al.  Speaker adaptive training: a maximum likelihood approach to speaker normalization , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Chin-Hui Lee,et al.  Maximum a posteriori linear regression for hidden Markov model adaptation , 1999, EUROSPEECH.

[6]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[7]  Oliver Watts,et al.  Evaluation of a Transplantation Algorithm for Expressive Speech Synthesis , 2013 .

[8]  Fernando Fernández Martínez,et al.  I Feel You: The Design and Evaluation of a Domotic Affect-Sensitive Spoken Conversational Agent , 2013, Sensors.

[9]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[10]  Tomohiro Inoue,et al.  Proposal of a Japanese-speech-synthesis Method with Dimensional Representation of Emotions based on Prosody as well as Voice-quality Conversion , 2013 .

[11]  Simon King,et al.  Noise robustness in HMM-TTS speaker adaptation , 2013, SSW.

[12]  Junichi Yamagishi,et al.  Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis , 2012, Speech Commun..

[13]  Oliver Watts,et al.  Towards speaking style transplantation in speech synthesis , 2013, SSW.

[14]  Paavo Alku,et al.  Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise , 2014, Comput. Speech Lang..

[15]  Roberto Barra-Chicote Contributions to the analysis, design and evaluation of strategies for corpus-based emotional speech synthesis , 2011 .

[16]  Thierry Dutoit,et al.  Continuous Control of the Degree of Articulation in HMM-Based Speech Synthesis , 2011, INTERSPEECH.

[17]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[18]  Javier Macías Guarasa,et al.  Spanish Expressive Voices: corpus for emotion research in Spanish , 2008, LREC 2008.

[19]  Koichi Shinoda,et al.  Structural MAP speaker adaptation using hierarchical priors , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[20]  David Escudero Mancebo,et al.  Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence , 2012, Speech Commun..

[21]  Ibon Saratxaga,et al.  Emotion Conversion Based on Prosodic Unit Selection , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Diego Rodriguez-Losada,et al.  Urbano, an Interactive Mobile Tour-Guide Robot , 2008 .

[23]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[24]  Alex Acero,et al.  Factored adaptation using a combination of feature-space and model-space transforms , 2012, INTERSPEECH.

[25]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[26]  Mark J. F. Gales,et al.  Speech factorization for HMM-TTS based on cluster adaptive training , 2012, INTERSPEECH.

[27]  Paavo Alku,et al.  Towards Glottal Source Controllability in Expressive Speech Synthesis , 2012, INTERSPEECH.

[28]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[29]  Takashi Nose,et al.  An intuitive style control technique in HMM-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model , 2013, Speech Commun..

[30]  Mark J. F. Gales,et al.  Exploring Rich Expressive Information from Audiobook Data Using Cluster Adaptive Training , 2012, INTERSPEECH.

[31]  Ren-Hua Wang,et al.  HMM-Based Emotional Speech Synthesis Using Average Emotion Model , 2006, ISCSLP.

[32]  Anne Lacheret,et al.  Discrete/Continuous Modelling of Speaking Style in HMM-Based Speech Synthesis: Design and Evaluation , 2011, INTERSPEECH.

[33]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Chia-Ping Chen,et al.  Speaker-dependent model interpolation for statistical emotional speech synthesis , 2012, EURASIP J. Audio Speech Music. Process..