Data-Driven Synthesis of Fundamental Frequency Contours for TTS Systems Based on a Generation Process Model

A data-driven method of fundamental frequency (F0) contour synthesis was developed for Japanese text-to-speech (TTS) conversion systems. In the method, synthesis is done using the F0 contour generation process model, and the model parameters for each accent phrase are estimated using statistical methods. Although it was already shown that the synthesized F0 contours sounded highly natural as those using heuristic rules arranged by experts, occasional low quality happened depending on sentences to be synthesized. In the current paper, information on sentence structure, automatically obtainable through the parsing process, is added to input parameters of the statistical methods to obtain a better estimation. The experimental results showed that the new parameter was effective for improving especially phrase component estimation. Furthermore, data-driven estimation of accent phrase boundaries for input text, a necessary step to realize TTS conversion, was also realized in a similar way. The rate of correct estimation reached 90 %.