Contributions to the analysis, design and evaluation of strategies for corpus-based emotional speech synthesis
暂无分享,去创建一个
The work carried out in this Thesis have been focused on the improvement of the response generation module by the incorporation of emotional speech synthesis in Spanish. This Thesis is divided in three stages, each one related with one of the defined scientific contributions. Initially, in order to convey emotions through the speech signal, the relevance of each speech component has been studied. The complementary behaviour of segmental and supra-segmental rubrics has been demonstrated, by analysing its relevance for each of the studied emotions. The nature of the emotions, using an existing corpus, has been studied using automatic identification strategies and a perceptual evaluation of emotional stimuli synthesized by copy-synthesis. In addition to this, a speaker-independent modeling of emotional acoustic patterns has been studied by means of the implementation and evaluation of a multi-speaker and multi-language automatic emotion identification system. Additionally, the performance of a system for the automatic identification of real emotions (based on dynamic Bayesian networks) has been evaluated on the first international emotion recognition challenge. Secondly, the conclusions obtained from the previous analysis have been the base for the acquisition of a novel emotional corpus in Spanish, due to its multimedia and multi-speaker content. This corpus has been essential for the adaptation and the exhaustive evaluation of two of the state-of-the-art high quality speech synthesis techniques to the synthesis of emotional speech: unit selection synthesis, the dominant technique during last decade; and HMM-based synthesis, an emerging technique and base of the future research in this area for the next decade. After, an exhaustive and novel analysis of the obtained results from a perceptual evaluation, it has been shown that both techniques synthesise emotional speech with the same quality. Although the emotions are best identified when they are synthesised using the unit selection technique and the resulting emotional strength with this technique is the highest , the HMM-based synthesis is the technique that best models the prosodic information, extremely important in expressive speech. The HMM-based system adapted to Spanish has been awarded as the best system in the text-to-speech challenge at the Jornadas de Tecnologia del Habla in 2008. Finally, a new strategy for the emotional speaker-independent transformation of synthetic speech has been designed, implemented and evaluated using the emotional voices generated with one of the previous techniques (specifically, the voices successfully generated using the HMM-based techniques, due to the flexibility and the controllability of the speech model parameters and the excellent results obtained in the challenge). This new strategy consists on the extrapolation of the emotions through the relevant speech components found in the initial analysis. From the results of the perceptual evaluation, it has been confirmed that the emotional acoustic patters have been partially extrapolated to the neutral voice of a target speaker, without extrapolating the identity of the source speaker. Additionally, the strength of the extrapolation can be successfully modified by using an extrapolation factor. However, the strength of the extrapolation has a negative impact in the quality of the synthesised speech, especially when the emotion extrapolation is focused on the transformation of the spectral parameters. Finally, a new metric for the evaluation of the goodness of the proposed new strategy has been defined, based on the speech quality, emotion identification and speaker identification results.