论文信息 - Unit selection in a concatenative speech synthesis system using a large speech database

Unit selection in a concatenative speech synthesis system using a large speech database

One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning.

Alan W. Black | Andrew J. Hunt | Andrew J. Hunt | A. Black | A. Hunt

[1] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[2] Eric Moulines,et al. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[3] Yoshinori Sagisaka,et al. Concatenative speech synthesis by minimum distortion criteria , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Nick Campbell,et al. Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[5] Alan W. Black,et al. Prosody and the Selection of Source Units for Concatenative Synthesis , 1997 .