Analysis of the degradation of French vowels induced by the TD-PSOLA algorithm, in text-to-speech context

In concatenative speech synthesis systems, synthetic speech is obtained by concatenating acoustic units selected from a database of natural speech. The duration and fundamental frequency (F0) of the selected units are usually different from those requested by a prosodic model, and so some prosodic modification must be applied to the units in order to obtain the desired target. TD-PSOLA is an effective and widely used prosodic modification algorithm, but its use can degrade the perceived quality of the synthetic speech signal. This paper focuses on the evaluation of the degradation of French vowels and determines the influence of several parameters through an analysis of variance. The results show that vowels divide into two groups, based on their first formant frequency (F1). Finally, a modification cost function representative of the degradation is derived from the investigation.

[1]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Tomohisa Hirokawa,et al.  Segment selection and pitch modification for high quality speech synthesis using waveform segments , 1990, ICSLP.

[3]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[4]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[5]  A Kohlrausch,et al.  Psychoacoustical evaluation of PSOLA. II. Double-formant stimuli and the role of vocal perturbation. , 1999, The Journal of the Acoustical Society of America.

[6]  Hisashi Kawai,et al.  Development of a text-to-speech system for Japanese based on waveform splicing , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.