Corpus-based speech synthesis : Methods and challenges

Corpus-based approaches to speech synthesis have been advocated to overcome the limitations of concatenative synthesis from a fixed acoustic unit inventory. The frequency of unit concatenations in, e.g., diphone synthesis has been argued to contribute to the perceived lack of naturalness of synthetic speech. The key idea of corpus-based synthesis, or unit selection, is to use an entire speech corpus as the acoustic inventory and to select at run-time from this corpus the longest available strings of phonetic segments that match a sequence of target speech sounds in the utterance to be synthetized, thereby minimizing the number of concatenations and reducing the need for signal processing. This paper reviews the assumptions underlying this synthesis strategy and the different approaches to unit selection, as well as the major challenges encountered by corpus-based methods. One of the biggest problems to date is the relative weighting of acoustic distance measures. We further argue against the quest for ever larger speech databases with optimal coverage of the target domain - which is often the whole language. We also show that word- or syllabe-based approaches are only feasible in strictly closed application domains