On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference

Abstract Three experiments are reported that use new experimental methods for the evaluation of text-to-speech (TTS) synthesis from the user's perspective. Experiment 1, using sentence stimuli, and Experiment 2, using discrete “call centre” word stimuli, investigated the effect of voice gender and signal quality on the intelligibility of three concatenative TTS synthesis systems. Accuracy and search time were recorded as on-line, implicit indices of intelligibility during phoneme detection tasks. It was found that both voice gender and noise affect intelligibility. Results also indicate interactions of voice gender, signal quality, and TTS synthesis system on accuracy and search time. In Experiment 3 the method of paired comparisons was used to yield ranks of naturalness and preference. As hypothesized, preference and naturalness ranks were influenced by TTS system, signal quality and voice, in isolation and in combination. The pattern of results across the four dependent variables – accuracy, search time, naturalness, preference – was consistent. Natural speech surpassed synthetic speech, and TTS system C elicited relatively high scores across all measures. Intelligibility, judged naturalness and preference are modulated by several factors and there is a need to tailor systems to particular commercial applications and environmental conditions.

[1]  Raymond D. Kent,et al.  DECTalk and MacinTalk speech synthesizers: intelligibility differences for three listener groups. , 1998, Journal of speech, language, and hearing research : JSLHR.

[2]  Bernd Möbius,et al.  Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis , 2003, Int. J. Speech Technol..

[3]  C. Nass,et al.  Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. , 2001, Journal of experimental psychology. Applied.

[4]  Louis C. W. Pols,et al.  The use of large text corpora for evaluating text-to-speech systems , 1998, LREC.

[5]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[6]  G D Allen,et al.  Segmental intelligibility and speech interference thresholds of high-quality synthetic speech in presence of noise. , 1993, Journal of speech and hearing research.

[7]  David R. Cox Planning of Experiments , 1958 .

[8]  A. Syrdal,et al.  Applied speech technology , 1995 .

[9]  D E Charleston,et al.  Auditory Search Using Vowel Sounds , 1990, Perceptual and motor skills.

[10]  R. Wherry,et al.  Orders for the presentation of pairs in the method of paired comparisons. , 1938 .

[11]  R. Ross Optimum orders for the presentation of pairs in the method of paired comparisons. , 1934 .

[12]  Gérard Bailly,et al.  Close Shadowing Natural Versus Synthetic Speech , 2003, Int. J. Speech Technol..

[13]  M. Cole,et al.  Mind, culture, and activity : seminal papers from the Laboratory of Comparative Human Cognition , 1997 .

[14]  Alex I. C. Monaghan,et al.  A Metrical Model of Prosody for Multilingual TTS , 2003, Int. J. Speech Technol..

[15]  Steven E. Stern,et al.  The Persuasiveness of Synthetic Speech versus Human Speech , 1999, Hum. Factors.

[16]  Li Gong,et al.  To Mix or Not to Mix Synthetic Speech and Human Speech? Contrasting Impact on Judge-Rated Task Performance versus Self-Rated Performance and Attitudinal Responses , 2003, Int. J. Speech Technol..

[17]  C R Latimer,et al.  Search Time as a Function of Context Letter Frequency , 1972, Perception.

[18]  Jerome R. Bellegarda,et al.  Improved duration modeling of English phonemes using a root sinusoidal transformation , 1998, ICSLP.

[19]  Ana Isabel Mata,et al.  Prosodic Phrasing: Machine and Human Evaluation , 2001, Int. J. Speech Technol..

[20]  J Davis Auditory search for syllables embedded within meaningful sentences. , 1967, The Journal of the Acoustical Society of America.

[21]  L L Elliott,et al.  Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. , 1977, The Journal of the Acoustical Society of America.

[22]  David B Pisoni,et al.  Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems , 1986, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[23]  A. Cutler Phoneme-monitoring reaction time as a function of preceding intonation contour , 1976 .

[24]  Horabail S Venkatagiri Segmental intelligibility of four currently used text-to-speech synthesis methods. , 2003, The Journal of the Acoustical Society of America.

[25]  Cristina Delogu,et al.  Cognitive factors in the evaluation of synthetic speech , 1998, Speech Commun..

[26]  D. Kahneman,et al.  Attention and Effort , 1973 .

[27]  H. Pashler The Psychology of Attention , 1997 .

[28]  Robert L. Solso Mind and brain sciences in the 21st century , 2000 .

[29]  Clifford Nass,et al.  Speech-Based Disclosure Systems: Effects of Modality, Gender of Prompt, and Gender of User , 2003, Int. J. Speech Technol..