Evaluating comprehension of natural and synthetic conversational speech

Current speech synthesis methods typically operate on isolated sentences and lack convincing prosody when generating longer segments of speech. Similarly, prevailing TTS evaluation paradigms, such as intelligibility (transcription word error rate) or MOS, only score sentences in isolation, even though overall comprehension arguably is more important for speech-based communication. In an effort to develop more ecologicallyrelevant evaluation techniques that go beyond isolated sentences, we investigated comprehension of natural and synthetic speech dialogues. Specifically, we tested listener comprehension on long segments of spontaneous and engaging conversational speech (three 10-minute radio interviews of comedians). Interviews were reproduced either as natural speech, synthesised from carefully prepared transcripts, or synthesised using durations from forced-alignment against the natural speech, all in a balanced design. Comprehension was measured using multiple choice questions. A significant difference was measured between the comprehension/retention of natural speech (74% correct responses) and synthetic speech with forced-aligned durations (61% correct responses). However, no significant difference was observed between natural and regular synthetic speech (70% correct responses). Effective evaluation of comprehension remains elusive.

[1]  Thomas Portele,et al.  Comparing the comprehensibility of different synthetic voices in a dual task experiment , 1998, SSW.

[2]  S A Duffy,et al.  Comprehension of Synthetic Speech Produced by Rule: A Review and Theoretical Interpretation , 1992, Language and speech.

[3]  Kim E. A. Silverman,et al.  Evaluating synthesiser performance: is segmental intelligibility enough? , 1990, ICSLP.

[4]  Martin Corley,et al.  The effect of filled pauses and speaking rate on speech comprehension in natural, vocoded and synthetic speech , 2014, INTERSPEECH.

[5]  Yu-Yun Chang,et al.  Evaluation of TTS Systems in Intelligibility and Comprehension Tasks , 2011, ROCLING/IJCLCLP.

[6]  David B. Pisoni,et al.  Perceptual evaluation of MITalk: The MIT unrestricted text-to-speech system , 1980, ICASSP.

[7]  Marcela Charfuelan,et al.  Expressive speech synthesis in MARY TTS using audiobook data and emotionML , 2013, INTERSPEECH.

[8]  Maja Markovic,et al.  The Role of Prosody in the Perception of Synthesized and Natural Speech , 2015, SPECOM.

[9]  Cristina Delogu,et al.  Cognitive factors in the evaluation of synthetic speech , 1998, Speech Commun..

[10]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .

[11]  D. Jeffery Higginbotham,et al.  Discourse comprehension of synthetic speech delivered at normal and slow presentation rates , 1994 .

[12]  David B Pisoni,et al.  Comprehension of natural and synthetic speech: effects of predictability on the verification of sentences controlled for intelligibility. , 1987, Computer speech & language.

[13]  Mark J. F. Gales,et al.  Exploring Rich Expressive Information from Audiobook Data Using Cluster Adaptive Training , 2012, INTERSPEECH.

[14]  Simon King,et al.  A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis , 2014, INTERSPEECH.

[15]  D B Pisoni,et al.  Comprehension of Synthetic Speech Produced by Rule: Word Monitoring and Sentence-by-Sentence Listening Times , 1991, Human factors.

[16]  Zhizheng Wu,et al.  Sentence-level control vectors for deep neural network speech synthesis , 2015, INTERSPEECH.

[17]  James J. Jenkins,et al.  Recall of passages of synthetic speech , 1982 .

[18]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[19]  Susan Fitt,et al.  On generating combilex pronunciations via morphological analysis , 2010, INTERSPEECH.

[20]  Mark J. F. Gales,et al.  Speech intonation for TTS: study on evaluation methodology , 2014, INTERSPEECH.

[21]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[22]  David B. Pisoni,et al.  Perception and Comprehension of Synthetic Speech 1 , 2004 .

[23]  H C Nusbaum,et al.  Effects of Speech Rate and Pitch Contour on the Perception of Synthetic Speech , 1985, Human factors.

[24]  Marcus Tomalin,et al.  Artificial personality and disfluency , 2015, INTERSPEECH.

[25]  Oliver Watts,et al.  TUNDRA: a multilingual corpus of found data for TTS research created with light supervision , 2013, INTERSPEECH.

[26]  Kim E. A. Silverman,et al.  Evaluating the overall comprehensibility of speech synthesizers , 1992, ICSLP.

[27]  Richard D. Gilson,et al.  Linguistic Cues and Memory for Synthetic and Natural Speech , 2000, Hum. Factors.

[28]  Mary E Reynolds,et al.  A comparison of learning curves in natural and synthesized speech comprehension. , 2002, Journal of speech, language, and hearing research : JSLHR.

[29]  S. King,et al.  The Blizzard Challenge 2012 , 2012 .

[30]  Ann Cutler,et al.  Prosody in the Comprehension of Spoken Language: A Literature Review , 1997, Language and speech.