Definition of a training set for unit selection-based speech synthesis

The definition of cost terms in unit selection based synthesis is a difficult task. Usually cost terms are based on common phonetic knowledge of the developers and subsequent perceptual experiments. The dataset used for supervised learning, well known from pattern recognition, could be a useful way to arrive at a more formal analysis of the different factors influencing the selection of units. As a first step toward this aim we present an objective distance measure which is used to sort the units contained in the corpus in relation to a given natural unit and prove its relevance to human perception. To avoid too much attention of the listeners to discontinuities caused by concatenation, we will also present a waveform-based smoothing algorithm. It is experimentally shown that the sorting criterion and the human perception match in most cases. Furthermore it can be detected that similarity between natural and synthetic speech is better if phoneme based units are used, but naturalness increases with the concatenation of larger units.