Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling

In this paper we present a method for speech modeling and its utilization in IBM’s small footprint concatenative text-tospeech system. The method is based on frequency-domain, complex spectral envelope modeling, where the phase component plays a crucial role in attaining high quality speech synthesis. The modeling scheme presented enables low bit rate compression of the amplitude and phase information and lowcomplexity reconstruction of high quality speech with wide range pitch modification. Listening tests conducted for the overall text-to-speech system show a major improvement in MOS, compared to a previous, MFCC-based, system.