A hybrid approach to synthesize high quality Cantonese speech

Synthesizing high quality speech necessitates an intelligent modification algorithm to adjust the important prosodic features of the pre-stored speech units to meet the desired output requirements, such as smoothness, naturalness and pleasantness. The time domain pitch-synchronous overlap and add (TD-PSOLA) scheme is a simple but effective method of varying the pitch and time-scaling of speech and it can produce high quality synthetic output. However, when the prosodic pattern requires a drastic modification in the spectral content of the stored units, TD-PSOLA often generates speech with reverberant sound. This paper develops a new hybrid synthesis method based on TD-PSOLA and shape-invariant sinusoidal technique to alleviate the problem of reverberation. It is particularly useful for the generation of Cantonese speech, since it can cope with the rapidly changing of the pitch profile of Cantonese, which is a mono-syllabic and tonal language. The proposed method has been applied to construct a Cantonese synthesizer which is shown to be capable of producing high quality Cantonese speech without reverberation.

[1]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[2]  Olivier Boëffard,et al.  Multilingual PSOLA text-to-speech system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[4]  Carmen García-Mateo,et al.  Shape-invariant pitch-synchronous text-to-speech conversion , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[6]  J. L. Le Saint-Milon,et al.  A real-time French text-to-speech system generating high-quality synthetic speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.