High-quality speech synthesis using context-dependent syllabic units

We propose a new method for constructing a context-dependent Japanese syllabic unit inventory, which effectively incorporates the spectral influence of left- and right-hand neighboring phonemes on CV (consonant and vowel) syllabic units. The syllabic unit inventory is generated by using a set of phonemic clusters, which we define, to approximate the average spectral behavior of phonemes in triphone contexts. Unit inventories of multiple sizes are generated for a waveform-concatenation-based text-to-speech system. The sizes of the inventories range from 322 to 3892 units, according to the fineness of the phonemic cluster set used. The synthetic speech generated by using the largest one is highly natural and intelligible. We also discuss the scalability of the proposed unit inventories.