Improvement in corpus-based generation of F0 contours using generation process model for emotional speech synthesis

A corpus-based method was developed for generating fundamental frequency contours in emotional speech synthesis. The method assumes the generation process model and predicts its command parameters (positions and amplitudes) using binary regression trees with the input of linguistic information of the sentence to be synthesized. Because of the model constraint, a certain quality is still kept in synthesized speech even if the prediction is done incorrectly. The speech corpus includes three types of emotional speech (anger, joy, sadness) and calm speech uttered by a female narrator. The command parameters necessary for the training and testing of the method were automatically extracted from speech using a program developed by the authors. Since the accuracy of the extraction largely affects the prediction performance, a constraint is applied on the position of phrase commands during the extraction. The method first predicts the phrase command parameters, which are then used for the prediction of accent command parameters. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Synthesis of emotional speech was conducted with text inputs. The segmental features were handled by the HMM synthesis method and the phoneme durations are predicted in a similar corpus-based method. Perceptual experiment was conducted using the synthesized speech, and the result indicated that the anger could be well conveyed by the developed method. The result came worse for joy and sadness.