A hybrid model for text-to-speech synthesis

This paper describes a hybrid model developed for high-quality, concatenation-based, text-to-speech synthesis. The speech signal is submitted to a pitch-synchronous analysis and decomposed into a harmonic component, with a variable maximum frequency, plus a noise component. The harmonic component is modeled as a sum of sinusoids with frequencies that are multiples of the pitch. The noise component is modeled as a random excitation applied to an LPC filter. In unvoiced segments, the harmonic component is made equal to zero. In the presence of pitch modifications, a new set of harmonic parameters is evaluated by resampling the spectrum envelope at the new harmonic frequencies. For the synthesis of the harmonic component in the presence of duration and/or pitch modifications, a phase correction is introduced into the harmonic parameters. The sinusoidal model of synthesis is used for the harmonic component and the LPC model combined with an overlap and add procedure is used for the noise synthesis. This hybrid model enables independent and continuous control of duration and pitch of the synthesized speech. Comparative evaluation tests made in a text-to-speech environment have shown that the hybrid model assures a better performance than the time-domain pitch synchronous overlap-add (TD-PSOLA) model.

[1]  K Schäfer-Vincent,et al.  Pitch Period Detection and Chaining: Method and Evaluation , 1983, Phonetica.

[2]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[3]  Mark A. Clements,et al.  Speech concatenation and synthesis using an overlap-add sinusoidal model , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Olivier Boëffard,et al.  Improving the robustness of text-to-speech synthesizers for large prosodic variations , 1994, SSW.

[5]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[6]  Luís B. Almeida,et al.  Frequency-varying sinusoidal modeling of speech , 1989, IEEE Trans. Acoust. Speech Signal Process..

[7]  Luís B. Almeida,et al.  Sinusoidal modeling of voiced and unvoiced speech , 1989, EUROSPEECH.

[8]  Eric Moulines,et al.  HNS: Speech modification based on a harmonic+noise model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Thierry Dutoit,et al.  MBR-PSOLA: Text-To-Speech synthesis based on an MBE re-synthesis of the segments database , 1993, Speech Commun..

[10]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..