Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations

Tallying the numbers of listeners that took part in subjective evaluations of synthetic speech at Interspeech 2014 showed that in more than 60% of papers conclusions are based on listening tests with less than 20 listeners. Our analysis of Blizzard 2013 data shows that for a MOS test measuring naturalness a stable level of significance is only reached when more than 30 listeners are used. In this paper, we set out a list of guidelines, i.e., a checklist for carrying out meaningful subjective evaluations. We further illustrate the importance of sentence coverage and number of listeners by presenting changes to rank order and number of significant pairs by re-analysing data from the Blizzard Challenge 2013.

[1]  Katherine Morton Expectations for Assessment Techniques Applied to Speech Synthesis , 1991 .

[2]  Wei Liu,et al.  The Effect of Age and Native Speaker Status on Intelligibility , 2013 .

[3]  Sabine Buchholz,et al.  Crowdsourced Assessment of Speech Synthesis , 2013 .

[4]  James R. Lewis,et al.  Expanding the MOS: Development and Psychometric Evaluation of the MOS-R and MOS-X , 2003, Int. J. Speech Technol..

[5]  Anne Cutler,et al.  Non-native speech perception in adverse conditions: A review , 2010, Speech Commun..

[6]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[7]  Nick Campbell,et al.  EVALUATION OF SPEECH SYNTHESIS From Reading Machines to Talking Machines , 2007 .

[8]  Simon King,et al.  Statistical analysis of the Blizzard Challenge 2007 listening test results , 2007 .

[9]  Laila Dybkjær,et al.  Evaluation of Text and Speech Systems , 2007 .

[10]  D.B. Pisoni,et al.  Perception of synthetic speech generated by rule , 1985, Proceedings of the IEEE.

[11]  Donald Fucci,et al.  Synthetic Speech Intelligibility: Comparison of Native and Non-native Speakers of English , 1996 .

[12]  Sebastian Möller,et al.  Quality prediction for synthesized speech: Comparison of approaches , 2009 .

[13]  David T. Barnard Evaluation of Text and Speech Systems , 2013 .

[14]  Maria Klara Wolters,et al.  Evaluating speech synthesis intelligibility using Amazon Mechanical Turk , 2010, SSW.

[15]  David B. Pisoni,et al.  Perception of Synthetic Speech , 1997 .

[16]  Simon King,et al.  Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation , 2014 .

[17]  S. Gordon-Salant,et al.  Selected cognitive factors and speech recognition performance among young and elderly listeners. , 1997, Journal of speech, language, and hearing research : JSLHR.

[18]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[19]  Cha Zhang,et al.  CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  S. King,et al.  The Blizzard Challenge 2013 , 2013, The Blizzard Challenge 2013.

[21]  Li-Rong Dai,et al.  Minimum Kullback–Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  S. King,et al.  Improving Instrumental Quality Prediction Performance for the Blizzard Challenge , 2008 .

[23]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Mark J. F. Gales,et al.  Speech intonation for TTS: study on evaluation methodology , 2014, INTERSPEECH.

[25]  Junichi Yamagishi,et al.  A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis , 2015, INTERSPEECH.