Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis

Prosody is an important factor in the quality of text-tospeech (TTS) synthesis. Typically, acoustic parameters such as f0 and duration are the only variables related to prosody that are used to determine unit selection. Our study explored adding the explicit use of linguistically and perceptually motivated prosodic categories in unit selection-based TTS. One of our goals was to automate the process of prosodically labeling our TTS inventory. However, reliability among labelers for some ToBI[6] (Tones and Break Indices) categories was too low[9] for successful training of an automatic prosody recognizer. We developed a prosody labeling system simpler and more robust than standard EToBI (English ToBI). This \ToBI Lite" system was used successfully for automatic labeling of the acoustic inventory and in prosodically enriched unit selection. A formal listening test was conducted to compare subjective quality ratings for several variations of the AT&T unit selection concatenative TTS system that di ered only in their method of prosodic labeling of the inventory or their use of prosody for unit selection. The use of simple prosodic categories in unit selection signi cantly improved ratings, and automatic prosodic labeling resulted in higher ratings than manual labeling.