Flexible harmonic/stochastic speech synthesis

In this paper, our flexible harmonic/stochastic waveform generator for a speech synthesis system is presented. The speech is modeled as the superposition of two components: a harmonic component and a stochastic or aperiodic component. The purpose of this representation is to provide a framework with maximum flexibility for all kind of speech transformations. In contrast to other similar systems found in the literature, like HNM, our system can operate using constant frame rate instead of a pitch-synchronous scheme. Thus, the analysis process is simplified, while the phase coherence is guaranteed by the new prosodic modification and concatenation procedures that have been designed for this scheme. As the system was created for voice conversion applications, in this work, as a previous step, we validate its performance in a speech synthesis context by comparing it to the well-known TD-PSOLA technique, using four different voices and different synthesis database sizes. The opinions of the listeners indicate that the methods and algorithms described are preferred rather than PSOLA, and thus are suitable for high-quality speech synthesis and for further voice transformations.

[1]  Zhiwei Shuang,et al.  High Quality Sinusoidal Modeling of Wideband Speech for the Purposes of Speech Synthesis and Modification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[3]  Darragh O'Brien,et al.  Concatenative synthesis based on a harmonic model , 2001, IEEE Trans. Speech Audio Process..

[4]  Thomas F. Quatieri,et al.  Shape invariant time-scale and pitch modification of speech , 1992, IEEE Trans. Signal Process..

[5]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[6]  Yannis Stylianou Removing linear phase mismatches in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[7]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[8]  Daniel Erro,et al.  Weighted frequency warping for voice conversion , 2007, INTERSPEECH.

[9]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[10]  Antonio Bonafonte,et al.  Ogmios: The UPC Text-to-Speech synthesis system for Spoken Translation , 2006 .

[11]  Daniel Erro,et al.  A Pitch-Asynchronous Simple Method for Speech Synthesis by Diphone Concatenation using the Deterministic plus Stochastic Model , 2005 .

[12]  P. Depalle,et al.  Extraction of spectral peak parameters using a short-time Fourier transform modeling and no sidelobe windows , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.