In HMM-based TTS, statistical models of static, velocity (delta), and acceleration (delta-delta) parameters are jointly trained in a unified, ML-based framework. Previous study has shown that the acceleration parameters are able to generate smoother trajectory with less distortions, but the effect has never been investigated in formal objective and subjective tests. In this paper, the effect of the acceleration parameters, in addition to their static and velocity counterparts, in trajectory generation is studied in depth. We show that discarding acceleration parameters only introduces small additional distortion compared to the reference generated with full model parameters. But human subjects can easily perceive the voice quality degradation, because saw-tooth-like trajectories are commonly generated. Several methods to alleviate the discontinuity are discussed, and we choose the upperand lower-bounded envelopes of the saw-tooth trajectories for further analysis. Experimental results show that both envelope trajectories have larger objective distortions than the saw-tooth ones. However, the speech synthesized using the envelope trajectory becomes perceptually transparent to the reference. This study, in addition to its subjective and objective significance in measuring the distortion of the synthesized speech, facilitates efficient implementation of low-cost TTS systems, as well as low bit rate speech coding and reconstruction.
[1]
Keiichi Tokuda,et al.
An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features
,
1995,
EUROSPEECH.
[2]
Keiichi Tokuda,et al.
Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis
,
1999,
EUROSPEECH.
[3]
Le Zhang,et al.
Acoustic-Articulatory Modeling With the Trajectory HMM
,
2008,
IEEE Signal Processing Letters.
[4]
Biing-Hwang Juang,et al.
Optimal quantization of LSP parameters
,
1993,
IEEE Trans. Speech Audio Process..
[5]
Keiichi Tokuda,et al.
Speech parameter generation algorithms for HMM-based speech synthesis
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).