Prosody and the Selection of Source Units for Concatenative Synthesis

This chapter describes a procedure for processing a large speech corpus to provide a reduced set of units for concatenative synthesis. Crucial to this reduction is the optimal utilization of prosodic labeling to reduce acoustic distortion in the resulting speech waveform. We present a method for selecting units for synthesis by optimizing a weighting between continuity distortion and unit distortion. The sourceunit set is determined statistically from a speech corpus by representing it as the set of sound sequences that occur with equal frequency, i.e., by recursively grouping pairs of segment labels to grow nonuniform-length compound label-strings. Storing multiple units with different prosodic characteristics then ensures that the reduced database will be maximally representative of the natural variation in the original speech. The choice of an appropriate depth to which to prune the database reflects a trade-off between compact size and output voice quality; a larger database is more likely to contain a prosodically appropriate segment that will need less modification to reach a target setting in the concatenated utterance.