Evaluation of TTS Systems in Intelligibility and Comprehension Tasks

This paper aims at finding the relationships between intelligibility and comprehensibility in speech synthesizers, and tries to design an appropriate comprehension task for evaluating the speech synthesizers' comprehensibility. It is predicted that speech synthesizer with higher intelligibility, will have greater performance in comprehension. Also, since the two most popular used speech synthesis methods are HMM-based and unit selection, this study tries to compare whether the HTS-2008 (HMM-based) or Multisyn (unit selection) speech synthesizer has better performance in application. Natural speech is applied in the experiment as a controlled group to the speech synthesizers. The results in the intelligibility test shows that natural speech is better than HTS-2008, and HTS-2008 is much better than Multisyn system. Whereas, in the comprehension task, all the three speech systems present not much differences in speech comprehending process. This is because that the two speech synthesizers have reached the threshold of enough intelligibility to provide high speech comprehension quality. Therefore, although with equal comprehensible speech quality between HTS-2008 and Multisyn systems, HTS-2008 speech synthesizer is more recommended and preferable due to its higher intelligibility.

[1]  Angelien Sanderman,et al.  Prosodic Phrasing and Comprehension , 1997 .

[2]  Catherine J. Stevens,et al.  On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference , 2005, Comput. Speech Lang..

[3]  Simon King,et al.  Multisyn: Open-domain unit selection for the Festival speech synthesis system , 2007, Speech Commun..

[4]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[5]  David Wood,et al.  The effect of task conditions on the comprehensibility of synthetic speech , 2000, CHI.

[6]  Paul A. Luce,et al.  Comprehension of fluent synthetic speech produced by rule , 1982 .

[7]  Cristina Delogu,et al.  Cognitive factors in the evaluation of synthetic speech , 1998, Speech Commun..

[8]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[9]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Jmb Jacques Terken,et al.  Effects of segmental quality and intonation on quality judgments for texts and utterances , 1988 .

[11]  Simon King,et al.  Festival 2 - build your own general purpose unit selection speech synthesiser , 2004, SSW.

[12]  David B. Pisoni,et al.  Perceptual evaluation of synthetic speech: Some considerations of the user/System interface , 1983, ICASSP.

[13]  Louis C. W. Pols,et al.  The use of large text corpora for evaluating text-to-speech systems , 1998, LREC.

[14]  K. Hustad The relationship between listener comprehension and intelligibility scores for speakers with dysarthria. , 2008, Journal of speech, language, and hearing research : JSLHR.

[15]  H. Kobayashi,et al.  An efficient forward-backward algorithm for an explicit-duration hidden Markov model , 2003, IEEE Signal Processing Letters.

[16]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[17]  G. A. Miller,et al.  Some perceptual consequences of linguistic rules , 1963 .

[18]  Mervyn A. Jack,et al.  Evaluation of speech synthesis techniques in a comprehension task , 1991, Speech Commun..

[19]  D.B. Pisoni,et al.  Perception of synthetic speech generated by rule , 1985, Proceedings of the IEEE.

[20]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[21]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[22]  M. Goldstein,et al.  Classification of methods used for assessment of text-to-speech systems according to the demands placed on the listener , 1995, Speech Commun..

[23]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[24]  Alexander L. Francis,et al.  Measuring the naturalness of synthetic speech , 1995, Int. J. Speech Technol..

[25]  Heiga Zen,et al.  An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 , 2005, INTERSPEECH.

[26]  Kathryn M. Yorkston,et al.  Comprehensibility of Dysarthric Speech , 1996 .

[27]  Richard D. Gilson,et al.  Linguistic Cues and Memory for Synthetic and Natural Speech , 2000, Hum. Factors.

[28]  David R Beukelman,et al.  Listener comprehension of severely dysarthric speech: effects of linguistic cues and stimulus cohesion. , 2002, Journal of speech, language, and hearing research : JSLHR.