论文信息 - Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling

Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling

In this paper we present a method for speech modeling and its utilization in IBM’s small footprint concatenative text-tospeech system. The method is based on frequency-domain, complex spectral envelope modeling, where the phase component plays a crucial role in attaining high quality speech synthesis. The modeling scheme presented enables low bit rate compression of the amplitude and phase information and lowcomplexity reconstruction of high quality speech with wide range pitch modification. Listening tests conducted for the overall text-to-speech system show a major improvement in MOS, compared to a previous, MFCC-based, system.

[1] Zvi Kons,et al. Reducing the footprint of the IBM trainable speech synthesis system , 2002, INTERSPEECH.

[2] John H. L. Hansen,et al. Discrete-Time Processing of Speech Signals , 1993 .

[3] Yannis Stylianou,et al. Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[4] Meir Tzur,et al. Efficient periodicity extraction based on sine-wave representation and its application to pitch determination of speech signals , 2001, INTERSPEECH.

[5] Darragh O'Brien,et al. Concatenative synthesis based on a harmonic model , 2001, IEEE Trans. Speech Audio Process..

[6] Thomas F. Quatieri,et al. Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..