A simple and effective pitch re-estimation method for rich prosody and speaking styles in HMM-based speech synthesis

This paper proposes a novel way of controllable pitch re-estimation that can produce better pitch contour or provide diverse speaking styles for text-to-speech (TTS) systems. The method is composed of a pitch re-estimation model and a set of control parameters. The pitch re-estimation model is employed to reduce over-smoothing effects which is usually introduced by TTS training. The control parameters are designed to generate not only rich intonations but also speaking styles, e.g. a foreign accent or an excited tone. To verify the feasibility of the proposed method, we conducted experiments for both objective measures and subjective tests. Although the re-estimated pitch results in only slightly less prediction error for objective measure, it produces clearly better intonation for listening test. Moreover, the expressive speech can be generated successfully under the framework of controllable pitch re-estimation.