Emotional Speech Synthesis with Corpus-Based Generation of F 0 Contours Using Generation Process Model

A method was developed for the corpus-based synthesis of emotional speech. Fundamental frequency (F0) contours were synthesized by predicting command values of the generation process model using binary regression trees with the input of linguistic information of the sentence to be synthesized. Because of the model constraint, a certain quality is still kept in synthesized speech even if the prediction is done poorly. Prediction of the accent phrase boundaries for the input text, a necessary process for the synthesis, was also realized in a similar statistical framework. The HMM synthesis scheme was used to generate segmental features. The speech corpus used for the synthesis includes three types of emotional speech (anger, joy, sadness) and calm speech uttered by a female narrator. The command values of the model necessary for the training and testing of the method were automatically extracted using a program developed by the authors. For the better prediction, accent phrases where the automatic extraction was done poorly were excluded from the training corpus. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Larger mismatches were observed for sad speech and joyful speech. Perceptual experiment was conducted using synthesized speech, and the result indicated that the anger could be well conveyed by the developed method.

[1]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[2]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  Keikichi Hirose,et al.  Analytical and perceptual study on the role of acoustic features in realizing emotional speech , 2000, INTERSPEECH.

[4]  Keikichi Hirose,et al.  Corpus-based synthesis of fundamental frequency contours based on a generation process model , 2001, INTERSPEECH.

[5]  Keikichi Hirose,et al.  A method for automatic extraction of model parameters from fundamental frequency contours of speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Keikichi Hirose,et al.  Data-driven generation of F0 contours using a superpositional model , 2003, Speech Commun..

[7]  Keikichi Hirose,et al.  Corpus-based synthesis of fundamental frequency contours of Japanese using automatically-generated prosodic corpus and generation process model , 2003, INTERSPEECH.

[8]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[9]  Keikichi Hirose,et al.  Automatic Estimation of Accentual Attribute Values of Words for Accent Sandhi Rules of Japanese Text-to-Speech Conversion , 2003 .

[10]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.