Abstract An intelligibility test was run to assess various text-to-speech synthesizers. Twenty semantically unpredictable sentences were generated in each of the five selected syntactic structures. These basic structures were defined as a crosslinguistic methodology for corpus generation in a European environment. Results of the French test are here presented. They show that the “SAM” methodology is efficient for the assessment of TTS systems, as it allows comparisons of prosodic, coding, semantic and feed-forward factors between synthesizers. Responses from twenty listeners during five sessions are also analysed. The distribution shows a strong relationship between the proportion of correct sentences (ps) and of correct words (pw). The ratio r = Log(ps)/Log(pw) seems to be a powerful index for measuring the complexity of a spoken message. Data replotted from the literature confirm the hypothesis that the higher the contextual (semantic, syntactic, etc.) information in a sentence, the lower this index r. A sentence can be considered as a sequence of more or less linguistically related symbolic units (phonemes, syllables, words, etc.), but the comprehension of a message by listeners depends on an unknown number of subjective units, which Miller called “decision units in the perception of speech”, and which result from various bottom-up and top-down strategies of identification and verification at the acoustic-phonetic level. In that perspective, the index r could be related to the number of decision units listeners must deal with when listening to a sentence. Speech synthesizers distort the comprehension of sentences. The distribution of omissions and mistakes does not obey the binomial law that would be expected from a simple model, where all input units have the same independent probability to be correctly identified. Analysis of the discrepancy between the experimental distribution of word errors and the binomial distribution obtained from the simple model provides a fruitful explanation of the fact that the linguistic relations between words allow a correction of “theoretically misunderstood” words and a distortion of “theoretically understood” words. Such a phenomenon mainly depends on linguistic content of the sentences that may be quantified by means of the suggested index r. It also shows second order variations due to other factors such as subjects' compentence and training, or the acoustic degradation of the message.
[1]
George A. Miller,et al.
Decision units in the perception of speech
,
1962,
IRE Trans. Inf. Theory.
[2]
Eric Moulines,et al.
A diphone synthesis system based on time-domain prosodic modifications of speech
,
1989,
International Conference on Acoustics, Speech, and Signal Processing,.
[3]
Martine Grice,et al.
Multilingual synthesiser assessment using semantically unpredictable sentences
,
1989,
EUROSPEECH.
[4]
G. A. Miller,et al.
Some perceptual consequences of linguistic rules
,
1963
.
[5]
G. A. Miller,et al.
The intelligibility of speech as a function of the context of the test materials.
,
1951,
Journal of experimental psychology.
[6]
G. Bailly,et al.
Multiparametric generation of French prosody from unrestricted text
,
1986,
ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.
[7]
P. Lieberman.
Some Effects of Semantic and Grammatical Context on the Production and Perception of Speech
,
1963
.