Choose the best to modify the least: a new generation concatenative synthesis system

The paper describes a corpus-based approach applied in the evolution of ELOQUENS, the CSELT text-to-speech synthesis system for Italian, towards multi-voice, multilanguage, high-naturalness concatenative synthesis. The acoustic modules have been redesigned, according to the idea of reducing the number of junctions and the need of prosodic modification. Appropriate phonetic coverage methods were applied in the acoustic database design. Automatic processing tools performed phone and diphone segmentation, pitch marking, prosodic feature detection. The synthesis algorithm exploits the speech material at its best, searching for the longest suitable sequences in the database, according to weighted distance measures on phonetic/prosodic parameters. Signal modification techniques are applied only if necessary, to smooth residual prosodic jumps at unit boundaries. The resulting voice is quite human-sounding. Keyword: corpus-based concatenative synthesis