Yelling at Your TV: An Analysis of Speech Recognition Errors and Subsequent User Behavior on Entertainment Systems

Millions of consumers issue voice queries through television-based entertainment systems such as the Comcast X1, the Amazon Fire TV, and Roku TV. Automatic speech recognition (ASR) systems are responsible for transcribing these voice queries into text to feed downstream natural language understanding modules. However, ASR is far from perfect, often producing incorrect transcriptions and forcing users to take corrective action. To better understand their impact on sessions, this paper characterizes speech recognition errors as well as subsequent user responses. We provide both quantitative and qualitative analyses, examining the acoustic as well as lexical attributes of the utterances. This work represents, to our knowledge, the first analysis of speech recognition errors from real users on a widely-deployed entertainment system.

[1]  Ido Guy,et al.  Searching by Talking: Analysis of Voice Queries on Mobile Web Search , 2016, SIGIR.

[2]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[3]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[4]  Jimmy J. Lin,et al.  Multi-Task Learning with Neural Networks for Voice Query Understanding on an Entertainment Platform , 2018, KDD.

[5]  Imed Zitouni,et al.  Automatic Online Evaluation of Intelligent Assistants , 2015, WWW.

[6]  Milad Shokouhi,et al.  Mobile query reformulations , 2014, SIGIR.

[7]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jimmy J. Lin,et al.  Talking to Your TV: Context-Aware Voice Search with Hierarchical Recurrent Neural Networks , 2017, CIKM.

[9]  Dong Yu,et al.  An introduction to voice search , 2008, IEEE Signal Processing Magazine.

[10]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[11]  Efthimis N. Efthimiadis,et al.  Analyzing and evaluating query reformulation strategies in web search logs , 2009, CIKM.

[12]  Daqing He,et al.  How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search , 2013, SIGIR.

[13]  Alan W. Black,et al.  Flite: a small fast run-time synthesis engine , 2001, SSW.

[14]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Ferhan Türe,et al.  What Do Viewers Say to Their TVs?: An Analysis of Voice Queries to Entertainment Systems , 2018, SIGIR.