This paper describes a new Korean Text-to-Speech (TTS) system based on a large speech corpus. Conventional concatenative TTS systems still produce machine-like synthetic speech. The poor naturalness is caused by excessive prosodic modification using a small speech database. To cope with this problem, we utilized a dynamic unit selection method based on a large speech database without prosodic modification. The proposed TTS system adopts triphones as synthesis units. We designed a new sentence set maximizing phonetic or prosodic coverage of Korean triphones. All the utterances were segmented automatically into phonemes using a speech recognizer. With the segmented phonemes, we achieved a synthesis unit cost of zero if two synthesis units were placed consecutively in an utterance. This reduces the number of concatenating points that may occur due to concatenating mismatches. In this paper, we present data concerning the realization of major prosodic variations through a consideration of prosodic phrase break strength. The phrase break was divided into four kinds of strength based on pause length. Using phrase break strength, triphones were further classified to reflect major prosodic variations. To predict phrase break strength on texts, we adopted an HMM-like Part-of-Speech (POS) sequence model. The performance of the model showed 73.5% accuracy for 4-level break strength prediction. For unit selection, a Viterbi beam search was performed to find the most appropriate triphone sequence, which has the minimum continuation cost of prosody and spectrum at concatenating boundaries. From the informal listening test, we found that the proposed Korean corpus-based TTS system showed better naturalness than the conventional demisyllable-based one.
[1]
Eric Moulines,et al.
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones
,
1989,
Speech Commun..
[2]
Akira Nakamura,et al.
A new approach to compensate degeneration of speech intelligibility for elderly listeners
,
1994,
ICSLP.
[3]
Alan W. Black,et al.
Unit selection in a concatenative speech synthesis system using a large speech database
,
1996,
1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.
[4]
Nick Campbell,et al.
Optimising selection of units from speech databases for concatenative synthesis
,
1995,
EUROSPEECH.
[5]
Paul Taylor,et al.
Assigning phrase breaks from part-of-speech sequences
,
1997,
Comput. Speech Lang..
[6]
Alexander G. Hauptmann,et al.
SPEAKEZ: a first experiment in concatenation synthesis from a large corpus
,
1993,
EUROSPEECH.
[7]
Mari Ostendorf,et al.
A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location
,
1994,
CL.
[8]
Ann K. Syrdal,et al.
Diphone synthesis using unit selection
,
1998,
SSW.
[9]
Mari Ostendorf,et al.
Automatic labeling of prosodic patterns
,
1994,
IEEE Trans. Speech Audio Process..
[10]
A. Wilgus,et al.
High quality time-scale modification for speech
,
1985,
ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.