Improving naturalness in text-to-speech synthesis using natural glottal source

Various methods to improve text-to-speech in its naturalness and its ability to model individual speakers are discussed. Methods using a natural glottal source which is extracted from natural speech by an inverse-filtering technique are described. One method uses a repeating loop. Another method creates a source waveform of the desired pitch by concatenating single pulses. A multisource method which utilizes different types of glottal source by cross-fading techniques is proposed. Perceptual listening tests were performed with synthetic stimuli. The preliminary results show that these methods have the potential to improve the quality of text-to-speech synthesis.<<ETX>>

[1]  Kenji Matsui,et al.  Text-to-speech synthesis using a natural voice source , 1990, ICSLP.

[2]  Masaaki Kitano,et al.  A Multi-lingual Text-to-Speech System , 1989 .

[3]  H. Ikuta,et al.  A multilingual text-to-speech system , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[5]  David M. Howard,et al.  Methods for dynamic excitation control in parallel formant speech synthesis , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[6]  E. Bognar,et al.  Analysis, synthesis and perception of the French nasal vowels , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  J. Holmes,et al.  The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer , 1973 .