MODELLING THE GLOBAL ACOUSTIC CORRELATES OF EXPRESSIVITY FOR CHINESE TEXT-TO-SPEECH SYNTHESIS

This paper proposed a novel approach for describing the expressive elements in dialog response messages for expressive text-to-speech synthesis. We adopt the three-dimensional PAD emotional model in describing expressivity based on response message content and its dialog state. In particular, we use the P (pleasure) and A (arousal) descriptors to describe expressivity at the local, prosodic-word level based on its semantics. We also use the D (dominance) descriptor to describe expressivity at the global, utterance level based on its dialog act. Our context of study is based on response messages of a spoken dialog system in the Hong Kong tourism domain. We also prepared contrastive (neutral versus expressive) recordings to aid identification of the acoustic correlates of expressivity at both local and global levels. We utilized the acoustic analysis of these contrastive recordings to establish a nonlinear model that can be used to modulate input neutral speech at both local and global levels to generate output expressive speech. This work focuses on the nonlinear relationship between the D (dominance) values and their acoustic correlates. Perceptual evaluation indicates that local modulation of input neutral speech produces over 73% utterances carry appropriate expressivity. The combined uses of both local and global modulations produce nearly 84% expressive utterances.