论文信息 - Removing phase mismatches in concatenative speech synthesis

Removing phase mismatches in concatenative speech synthesis

Concatenation of acoustic units is widely used in most of the currently available text-to-speech systems. While this approach leads to higher intelligibility and naturalness than synthesis-by-rule, it has to cope with the issues of concatenating acoustic units that have been recorded in a di erent order. One important issue in concatenation is that of synchronization of speech frames or, in other words, inter-frame coherence. This paper presents a novel method for synchronization of signals with applications to speech synthesis. The method is based on the notion of center of gravity applied to speech signals. It is an o -line approach as this can be done during analysis with no computational burden on synthesis. The method has been tested with the Harmonic plus Noise Model, HNM, on many large speech databases. The resulting synthetic speech is free of phase mismatch (inter-frame incoherence) problems.

Yannis Stylianou

[1] Andreas Spanias,et al. A new phase model for sinusoidal transform coding of speech , 1998, IEEE Trans. Speech Audio Process..

[2] Eric Moulines,et al. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[3] Eric Moulines,et al. HNS: Speech modification based on a harmonic+noise model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Thomas F. Quatieri,et al. Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[5] R.N. Bracewell,et al. Signal analysis , 1978, Proceedings of the IEEE.

[6] Dik J. Hermes,et al. Synthesis of breathy vowels: Some research methods , 1991, Speech Commun..

[7] Olivier Boëffard,et al. Improving the robustness of text-to-speech synthesizers for large prosodic variations , 1994, SSW.

[8] Yannis Stylianou,et al. Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[9] Thierry Dutoit,et al. Diphone concatenation using a harmonic plus noise model of speech , 1997, EUROSPEECH.

[10] Miguel Ángel Rodríguez Crespo,et al. On the Use of a Sinusoidal Model for Speech Synthesis in Text-to-Speech , 1997 .

[11] Eric Moulines,et al. High-quality speech modification based on a harmonic + noise model , 1995, EUROSPEECH.