Automatically clustering similar units for unit selection in speech synthesis

This paper describes a new method for synthesizing speech by concatenating sub-word units from a database of labelled speech. A large unit inventory is created by automatically clustering units of the same phone class based on their phonetic and prosodic context. The appropriate cluster is then selected for a target unit offering a small set of candidate units. An optimal path is found through the candidate units based on their distance from the cluster center and an acoustically based join cost. Details of the method and justification are presented. The results of experiments using two different databases are given, optimising various parameters within the system. Also a comparison with other existing selection based synthesis techniques is given showing the advantages this method has over existing ones. The method is implemented within a full text-to-speech system offering efficient natural sounding speech synthesis.

[1]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Yoshinori Sagisaka,et al.  ATR μ-talk speech synthesis system , 1992, ICSLP.

[4]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.

[5]  Stephen Isard,et al.  Optimal coupling of diphones , 1994, SSW.

[6]  Alan W. Black,et al.  Prosody and the Selection of Source Units for Concatenative Synthesis , 1997 .