论文信息 - On the use of a hybrid harmonic/stochastic model for TTS synthesis-by-concatenation

On the use of a hybrid harmonic/stochastic model for TTS synthesis-by-concatenation

Abstract In this paper, we address the possibilities offered by hybrid harmonic/stochastic (H/S) models in the context of wide-band text-to-speech synthesis based on segment concatenation. After a brief recall of the hypotheses underlying such models and a comprehensive review of a well-known analysis algorithm, namely the one provided by the multi-band excited (MBE) analysis framework, we study how H/S models allow to modify the prosody of segments and how segment concatenation can be organized, in the purpose of minimizing mismatches at the border of segments. In this context, we introduce an original concatenation algorithm which takes advantage of some analysis errors. Speech synthesis algorithms are then described, including an original synthesis technique based on judiciously prepared IFFTs, and the final segmental quality 1 is detailed. More particularly, we examine the differences in the quality obtained when using the model in a narrow-band speech coding context and in a wide-band, concatenation based synthesis context. We study three possible causes for these differences: the choice of an analysis criterion, the inadequacy of the model due to pitch variatons, and the effect of coarticulation on phases.

Thierry Dutoit | Bernard Gosselin

[1] Thomas F. Quatieri,et al. Magnitude-only reconstruction using a sinusoidal speech modelMagnitude-only reconstruction using a sinusoidal speech model , 1984, ICASSP.

[2] S. Mallat. Multiresolution approximations and wavelet orthonormal bases of L^2(R) , 1989 .

[3] Thomas F. Quatieri,et al. Sine-wave phase coding at low data rates , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4] Eric Moulines,et al. HNS: Speech modification based on a harmonic+noise model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Isabel Trancoso,et al. Hybrid sinusoidal modeling of speech without voicing decision , 1991, EUROSPEECH.

[6] Douglas L. Jones,et al. Real-valued fast Fourier transform algorithms , 1987, IEEE Trans. Acoust. Speech Signal Process..

[7] Andrew Perkis,et al. A multiband excitation linear predictive speech coder , 1991, EUROSPEECH.

[8] Luís B. Almeida,et al. Variable-frequency synthesis: An improved harmonic coding scheme , 1984, ICASSP.

[9] O. Fujimura. An approximation to voice aperiodicity , 1968 .

[10] John C. Hardwick,et al. A 4.8 kbps multi-band excitation speech coder , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[11] Thierry Dutoit,et al. An analysis of the performances of the MBE model when used in the context of a text-to-speech system , 1993, EUROSPEECH.

[12] Julius O. Smith,et al. Spectral modeling synthesis: A sound analysis/synthesis based on a deterministic plus stochastic decomposition , 1990 .

[13] Jae S. Lim,et al. Multiband excitation vocoder , 1988, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[14] Olivier Boëffard,et al. Improving the robustness of text-to-speech synthesizers for large prosodic variations , 1994, SSW.

[15] Bernd W. Kolpatzik,et al. Speech coding using nonstationary sinusoidal modelling and narrow-band basis functions , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[16] Luís B. Almeida,et al. Frequency-varying sinusoidal modeling of speech , 1989, IEEE Trans. Acoust. Speech Signal Process..

[17] Thomas F. Quatieri,et al. Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[18] Carmen García-Mateo,et al. A text-to-speech system for Spanish with a frequency domain based prosodic modification algorithm , 1993, ICASSP.

[19] F. Harris. On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[20] Eric Moulines,et al. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..