论文信息 - High-quality speech synthesis using context-dependent syllabic units

High-quality speech synthesis using context-dependent syllabic units

We propose a new method for constructing a context-dependent Japanese syllabic unit inventory, which effectively incorporates the spectral influence of left- and right-hand neighboring phonemes on CV (consonant and vowel) syllabic units. The syllabic unit inventory is generated by using a set of phonemic clusters, which we define, to approximate the average spectral behavior of phonemes in triphone contexts. Unit inventories of multiple sizes are generated for a waveform-concatenation-based text-to-speech system. The sizes of the inventories range from 322 to 3892 units, according to the fineness of the phonemic cluster set used. The synthetic speech generated by using the largest one is highly natural and intelligible. We also discuss the scalability of the proposed unit inventories.

Takashi Saito | Yasuhide Hashimoto | Masaharu Sakamoto

[1] Shin'ya Nakajima. English speech synthesis based on multi-layered context oriented clustering; towards multi-lingual speech synthesis , 1993, EUROSPEECH.

[2] Eric Moulines,et al. A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3] Shinya Nakajima. Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering , 1994, Speech Commun..