Exploring the naturalness of several German high-quality-text-to-speech systems

The synthesis of near-to-natural F0 contours is an important issue in text-to-speech and crucial to the naturalness and intelligibility of synthetic speech. In earlier studies of the first author a model of German intonation was developed that is based on the quantitative Fujisaki-model. The current paper addresses a perception experiment comparing a TTS-system incorporating this new approach with several German TTS-systems with high segmental quality. Natural speech samples and a synthesis version with natural segment durations were used as references. Results show, that the natural speech samples unanimously received 10 points on a 0 to 10 point scale. The best TTS-systems cluster around a mean value of 5.0, whereas the variant with natural durations reached a mean score of 6.6 points, indicating the importance of closely modeling natural segment durations.