This paper proposed a novel approach for describing the expressive elements in dialog response messages for expressive text-to-speech synthesis. We adopt the three-dimensional PAD emotional model in describing expressivity based on response message content and its dialog state. In particular, we use the P (pleasure) and A (arousal) descriptors to describe expressivity at the local, prosodic-word level based on its semantics. We also use the D (dominance) descriptor to describe expressivity at the global, utterance level based on its dialog act. Our context of study is based on response messages of a spoken dialog system in the Hong Kong tourism domain. We also prepared contrastive (neutral versus expressive) recordings to aid identification of the acoustic correlates of expressivity at both local and global levels. We utilized the acoustic analysis of these contrastive recordings to establish a nonlinear model that can be used to modulate input neutral speech at both local and global levels to generate output expressive speech. This work focuses on the nonlinear relationship between the D (dominance) values and their acoustic correlates. Perceptual evaluation indicates that local modulation of input neutral speech produces over 73% utterances carry appropriate expressivity. The combined uses of both local and global modulations produce nearly 84% expressive utterances.
[1]
Norbert Reithinger,et al.
Dia logue Acts in VERBMOBIL-2 Second Edition
,
1997
.
[2]
Lianhong Cai,et al.
Prosodic phrasing with inductive learning
,
2002,
INTERSPEECH.
[3]
Y. R. Chao,et al.
A Grammar of Spoken Chinese
,
1970
.
[4]
趙 元任,et al.
A grammar of spoken Chinese = 中國話的文法
,
1968
.
[5]
Nick Campbell.
Towards synthesising expressive speech; designing and collecting expressive speech data
,
2003,
INTERSPEECH.
[6]
Hideki Kawahara,et al.
Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT
,
2001,
MAVEBA.
[7]
Lianhong Cai,et al.
Modeling the acoustic correlates of expressive elements in text genres for expressive text-to-speech synthesis
,
2006,
INTERSPEECH.
[8]
A. Mehrabian.
Framework for a comprehensive description and measurement of emotional states.
,
1995,
Genetic, social, and general psychology monographs.
[9]
Helen M. Meng,et al.
Natural language response generation in mixed-initiative dialogs using task goals and dialog acts
,
2003,
INTERSPEECH.