论文信息 - Prosody and the Selection of Source Units for Concatenative Synthesis

Prosody and the Selection of Source Units for Concatenative Synthesis

This chapter describes a procedure for processing a large speech corpus to provide a reduced set of units for concatenative synthesis. Crucial to this reduction is the optimal utilization of prosodic labeling to reduce acoustic distortion in the resulting speech waveform. We present a method for selecting units for synthesis by optimizing a weighting between continuity distortion and unit distortion. The sourceunit set is determined statistically from a speech corpus by representing it as the set of sound sequences that occur with equal frequency, i.e., by recursively grouping pairs of segment labels to grow nonuniform-length compound label-strings. Storing multiple units with different prosodic characteristics then ensures that the reduced database will be maximally representative of the natural variation in the original speech. The choice of an appropriate depth to which to prune the database reflects a trade-off between compact size and output voice quality; a larger database is more likely to contain a prosodically appropriate segment that will need less modification to reach a target setting in the concatenated utterance.

Alan W. Black | Nick Campbell | A. Black | N. Campbell

[1] N. Iwahashi,et al. Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization , 1993 .

[2] K. D. Jong. The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation , 1995 .

[3] S. Nakajima,et al. Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4] Björn Lindblom,et al. Explaining Phonetic Variation: A Sketch of the H&H Theory , 1990 .

[5] A. Marchal,et al. Speech production and speech modelling , 1990 .

[6] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[7] Gérard Bailly,et al. Talking Machines: Theories, Models, and Designs , 1992 .

[8] W. Nick Campbell,et al. Prosodic encoding of English speech , 1992, ICSLP.

[9] Shinya Nakajima. Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering , 1994, Speech Commun..

[10] Yoshinori Sagisaka,et al. ATR μ-talk speech synthesis system , 1992, ICSLP.

[11] J. Sundberg,et al. Spectral correlates of glottal voice source waveform characteristics. , 1989, Journal of speech and hearing research.