Building an integrated prosodic model of German

The intellegibility and naturalness of synthetic speech strongly depends on its prosodic quality. Departing from works by Mixdorff on a linguistically motivated model of German intonation based on the Fujisaki model, the current paper presents statistical results concerning the relationship between linguistic and phonetic information underlying an utterance and its prosodic features. Statistical analysis yields, inter alia, the following pairs of strongest single factor → prosodic feature: boundary depth (right) → syllable duration; boundary depth (left) → phrase command magnitude Ap; accent type (intoneme) → accent command amplitude Aa. These results were employed for training an FFNN-based integrated prosodic model predicting syllable durations along with syllable-aligned Fujisaki control parameters. Correlations between trained and predicted parameters suggest synergy effects, as they are higher for some parameters than correlations yielded when predicting parameters individually from the same set of input features using a regression model. Informal listening tests with first resynthesis examples showed encouraging results.