A Technique for Estimating Intensity of Emotional Expressions and Speaking Styles in Speech Based on Multiple-Regression HSMM

In this paper, we propose a technique for estimating the degree or intensity of emotional expressions and speaking styles appearing in speech. The key idea is based on a style control technique for speech synthesis using a multiple regression hidden semi-Markov model (MRHSMM), and the proposed technique can be viewed as the inverse of the style control. In the proposed technique, the acoustic features of spectrum, power, fundamental frequency, and duration are simultaneously modeled using the MRHSMM. We derive an algorithm for estimating explanatory variables of the MRHSMM, each of which represents the degree or intensity of emotional expressions and speaking styles appearing in acoustic features of speech, based on a maximum likelihood criterion. We show experimental results to demonstrate the ability of the proposed technique using two types of speech data, simulated emotional speech and spontaneous speech with different speaking styles. It is found that the estimated values have correlation with human perception.

[1]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[2]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[4]  Donna Erickson,et al.  Expressive speech: Production, perception and application to speech synthesis , 2005 .

[5]  Ghen Ohyama,et al.  On the Differences in Prosodic Features of Emotional Expressions in Japanese Speech according to the Degree of the Emotion , 2004 .

[6]  Masatsune Tamura,et al.  A Context Clustering Technique for Average Voice Models , 2003 .

[7]  Takao Kobayashi,et al.  Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis , 2006, INTERSPEECH.

[8]  Takao Kobayashi,et al.  MLLR adaptation for hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[9]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[10]  T. Muraoka,et al.  Analysis of Prosodic Features of Emotional Expressions in Noh Farce ( “ Kyohgen ” ) Speech according to the Degree of Emotion , 2004 .

[11]  Gang Wei,et al.  Speech emotion recognition based on HMM and SVM , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[12]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[13]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[14]  Wei Wu,et al.  GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Takao Kobayashi,et al.  Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM , 2005, IEICE Trans. Inf. Syst..

[16]  Chun Chen,et al.  Speech Emotion Recognition and Intensity Estimation , 2004, ICCSA.

[17]  Shrikanth S. Narayanan,et al.  Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Keikichi Hirose,et al.  Analytical and perceptual study on the role of acoustic features in realizing emotional speech , 2000, INTERSPEECH.

[19]  Takashi Nose,et al.  Style estimation of speech based on multiple regression hidden semi-Markov model , 2007, INTERSPEECH.

[20]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[21]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[22]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[23]  Takao Kobayashi,et al.  Performance evaluation of HMM-based style classification with a small amount of training data , 2007, INTERSPEECH.

[24]  Takashi Nose,et al.  HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation , 2009, IEICE Trans. Inf. Syst..

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .