The intellegibility and naturalness of synthetic speech strongly depends on its prosodic quality. Departing from works by Mixdorff on a linguistically motivated model of German intonation based on the Fujisaki model, the current paper presents statistical results concerning the relationship between linguistic and phonetic information underlying an utterance and its prosodic features. Statistical analysis yields, inter alia, the following pairs of strongest single factor → prosodic feature: boundary depth (right) → syllable duration; boundary depth (left) → phrase command magnitude Ap; accent type (intoneme) → accent command amplitude Aa. These results were employed for training an FFNN-based integrated prosodic model predicting syllable durations along with syllable-aligned Fujisaki control parameters. Correlations between trained and predicted parameters suggest synergy effects, as they are higher for some parameters than correlations yielded when predicting parameters individually from the same set of input features using a regression model. Informal listening tests with first resynthesis examples showed encouraging results.
[1]
Dieter Mehnert,et al.
Exploring the naturalness of several German high-quality-text-to-speech systems
,
1999,
EUROSPEECH.
[2]
Hansjörg Mixdorff,et al.
Learning the parameters of quantitative prosody models
,
2000,
INTERSPEECH.
[3]
Petra Wagner,et al.
Synthesis by word concatenation
,
1999,
EUROSPEECH.
[4]
Hansjörg Mixdorff,et al.
A novel approach to the fully automatic extraction of Fujisaki model parameters
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).
[5]
Richard Shillcock,et al.
Proceedings of EUROSPEECH-1991.
,
1991
.
[6]
Keikichi Hirose,et al.
Analysis of voice fundamental frequency contours for declarative sentences of Japanese
,
1984
.
[7]
Stefan Rapp,et al.
Automatisierte Erstellung von Korpora f?r die Prosodieforschung
,
1998
.