Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content

The advancement of text-to-speech (TTS) voices and a rise of commercial TTS platforms allow people to easily experience TTS voices across a variety of technologies, applications, and form factors. As such, we evaluated TTS voices for long-form content: not individual words or sentences, but voices that are pleasant to listen to for several minutes at a time. We introduce a method using a crowdsourcing platform and an online survey to evaluate voices based on listening experience, perception of clarity and quality, and comprehension. We evaluated 18 TTS voices, three human voices, and a text-only control condition. We found that TTS voices are close to rivaling human voices, yet no single voice outperforms the others across all evaluation dimensions. We conclude with considerations for selecting text-to-speech voices for long-form content.

[1]  Shaun W. Lawson,et al.  Voice as a Design Material: Sociophonetic Inspired Design Strategies in Human-Computer Interaction , 2019, CHI.

[2]  Jens Edlund,et al.  The State of Speech in HCI: Trends, Themes and Challenges , 2018, Interact. Comput..

[3]  Clifford Nass,et al.  Can computer-generated speech have gender?: an experimental test of gender stereotype , 2000, CHI Extended Abstracts.

[4]  Abigail Sellen,et al.  "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents , 2016, CHI.

[5]  Marianne LaFrance,et al.  The quality of expertise: implications of expert-novice differences for knowledge acquisition , 1989, SGAR.

[6]  Rob Clark,et al.  Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[7]  S. Möller,et al.  An Evaluation Protocol for the Subjective Assessment of Text-to-Speech in Audiobook Reading Tasks , 2011 .

[8]  C. Nass,et al.  Are Machines Gender Neutral? Gender‐Stereotypic Responses to Computers With Voices , 1997 .

[9]  Chinmay Kulkarni,et al.  One Voice Fits All? Social Implications and Research Challenges of Designing Voices for Smart Devices , 2019 .

[10]  Ilaria Torre,et al.  Can you Tell the Robot by the Voice? An Exploratory Study on the Role of Voice in the Perception of Robots , 2019, 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[11]  Khalil Sima'an,et al.  Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship , 2006, Computational Linguistics.

[12]  Sean Andrist,et al.  Effects of Culture on the Credibility of Robot Speech: A Comparison between English and Arabic , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[13]  S. Möller,et al.  Towards Perceptual Quality Modeling of Synthesized Audiobooks – Blizzard Challenge 2012 , 2012 .

[14]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .

[15]  Clifford Nass,et al.  Similarity is more important than expertise: accent effects in speech interfaces , 2007, CHI.

[16]  Matthew P. Aylett,et al.  Beyond the Listening Test: An Interactive Approach to TTS Evaluation , 2017, INTERSPEECH.

[17]  Björn Schuller,et al.  The Perception of Vocal Traits in Synthesized Voices: Age, Gender, and Human Likeness , 2018 .

[18]  Cassia Valentini-Botinhao,et al.  Intelligibility-enhancing speech modifications: the hurricane challenge , 2020, INTERSPEECH.

[19]  Lone Koefoed Hansen,et al.  Intimate Futures: Staying with the Trouble of Digital Personal Assistants through Design Fiction , 2018, Conference on Designing Interactive Systems.

[20]  Laura Huang,et al.  Investors prefer entrepreneurial ventures pitched by attractive men , 2014, Proceedings of the National Academy of Sciences.

[21]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[22]  Simon King,et al.  Measuring the Cognitive Load of Synthetic Speech Using a Dual Task Paradigm , 2018, INTERSPEECH.

[23]  David Suendermann,et al.  Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment , 2013 .

[24]  Matthew P. Aylett,et al.  The right kind of unnatural: designing a robot voice , 2019, CUI.

[25]  C. Nass,et al.  Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. , 2001, Journal of experimental psychology. Applied.

[26]  Florian Alt,et al.  At Your Service: Designing Voice Assistant Personalities to Improve Automotive User Interfaces , 2019, CHI.

[27]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[28]  Taezoon Park,et al.  When stereotypes meet robots: The double-edge sword of robot gender and personality in human-robot interaction , 2014, Comput. Hum. Behav..

[29]  Sara L. Knox Hearing Hardy, talking Tolstoy : the audiobook narrator's voice and reader experience , 2011 .

[30]  Roger K. Moore Is Spoken Language All-or-Nothing? Implications for Future Speech-Based Human-Machine Interaction , 2016, IWSDS.

[31]  Simon King,et al.  Using Pupillometry to Measure the Cognitive Load of Synthetic Speech , 2018, INTERSPEECH.

[32]  Iben Have,et al.  Sonic mediatization of the book: affordances of the audiobook , 2013 .

[33]  Roger K. Moore Appropriate Voices for Artefacts: Some Key Insights , 2017 .

[34]  Alan W. Black,et al.  Improving the understandability of speech synthesis by modeling speech in noise , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[35]  Björn W. Schuller,et al.  The Perception and Analysis of the Likeability and Human Likeness of Synthesized Speech , 2018, INTERSPEECH.

[36]  Benjamin R. Cowan,et al.  "What can i help you with?": infrequent users' experiences of intelligent personal assistants , 2017, MobileHCI.

[37]  Sébastien Le Maguer,et al.  Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[38]  Katharina Reinecke,et al.  A Large Inclusive Study of Human Listening Rates , 2018, CHI.

[39]  Cassia Valentini-Botinhao,et al.  Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations , 2015, INTERSPEECH.