Text-To-Speech Intelligibility Across Speech Rates

A web-based listening test measured intelligibility across speech rate of 8 TTS systems and of a linearly timecompressed human speech reference voice. The synthesis systems included 2 independent representatives of each of the following 4 synthesis methods: formant, diphone concatenation, unit selection concatenation, and HMM. For each TTS system, a female and a male American English voice were tested. Semantically unpredictable sentences were presented at 6 speech rates from 200 to 450 words per minute. In an open response format, listeners typed what they heard. Listener transcriptions were automatically scored at the word level, and a normalized edit distance per speech rate was calculated for each of 355 listeners. There were significant differences among the TTS systems. The 2 unit selection TTS systems were the most intelligible across speech rates; one was equivalent to human speech. Listeners‟ native language, TTS familiarity, and audio equipment were also significant factors.