Speech-rate-variable HMM-based Japanese TTS system

This paper proposes a new method for controlling phoneme duration according to arbitrary target speech rate in speech synthesis (TTS, text-to-speech) systems. The proposed method first constructs three fundamental duration models at "fast", "normal", and "slow" speech rates using Hayashi's quantification theory (type 1) based on real speech databases and creates a duration model according to a target speech rate by interpolating the fundamental models. Our TTS system uses an HMM-based synthesizer which can achieve flexible prosody control. Various speech synthesized by the proposed method is evaluated by subjective experiments at four speech rates using pair comparison tests between the proposed method and a rule-based method. The results show that the proposed method achieves higher naturalness in synthesized speech than the rule-based method.