Domain‐specific prominence‐based concatenation

Some recent approaches to unlimited text‐to‐speech conversion do not use ‘‘designed’’ inventories of concatenative units like diphones. Instead, an annotated corpus of read speech is searched for the realization of a speech segment whose features best match the corresponding ones demanded by the synthesis input. Crucial for this approach are the size and the variety of the corpus, the number and kind of annotations, and the definition of ‘‘best match’’ based on these annotations. In this implementation, the parameter ‘‘perceived prominence’’ is relied on as the most important parameter for the selection of a prosodically appropriate realization. A previous investigation with a German corpus indicated, however, that a corpus has to be very large in order to allow prosodic variations (e.g., different focus placements) of one utterance while retaining an acceptable intelligibility, if no post‐selection signal modification is applied. The application of the same procedure to an American English corpus with do...