One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning.
[1]
Lawrence R. Rabiner,et al.
A tutorial on hidden Markov models and selected applications in speech recognition
,
1989,
Proc. IEEE.
[2]
Eric Moulines,et al.
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones
,
1989,
Speech Commun..
[3]
Yoshinori Sagisaka,et al.
Concatenative speech synthesis by minimum distortion criteria
,
1992,
[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[4]
Nick Campbell,et al.
Optimising selection of units from speech databases for concatenative synthesis
,
1995,
EUROSPEECH.
[5]
Alan W. Black,et al.
Prosody and the Selection of Source Units for Concatenative Synthesis
,
1997
.