An evaluation of automatic phone segmentation for concatenative speech synthesis

This paper studies the performance of automatic phone segmentation from two viewpoints: temporal precision and the effect on the naturalness of synthetic speech. The absolute error of the phone onset time for the best 90% and worst 10% were 4.6 ms and 25.9 ms, respectively. These values are comparable to discrepancies among human labelers. As the result of perception tests in which naturalness was pair-compared between synthetic speeches generated from hand-segmented data and from auto-segmented data, it was found that the latter is statistically inferior.

[1]  Tomoki Toda,et al.  Perceptual evaluation of cost for segment selection in concatenative speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[2]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[3]  Alan W. Black,et al.  Evaluating and correcting phoneme segmentation for unit selection synthesis , 2003, INTERSPEECH.

[4]  Steve Young,et al.  The HTK book , 1995 .

[5]  Andrej Ljolje,et al.  Automatic segmentation of speech for TTS , 1993, EUROSPEECH.