The effects of automatic speech recognition quality on human transcription latency

Converting speech to text quickly is the fundamental task for making aural content accessible to deaf and hard of hearing. Despite high cost, this is done by human captionists, as automatic speech recognition (ASR) does not give satisfactory performance in real world settings. Offering ASR output to captionists as a starting point seems more facile and economical, yet the effectiveness of this approach is clearly dependent on the quality of ASR because fixing inaccurate ASR output may take longer than producing the transcriptions without ASR support. In this paper, we empirically study how the time required by captionists to produce transcriptions from partially correct ASR output varies based on the accuracy of the ASR output. Our studies with 160 participants recruited on Amazon's Mechanical Turk indicate that starting with the ASR output is worse unless it is sufficiently accurate (Word Error Rate (WER) is under 30%).

[1]  Mary Harper The Automatic Speech recogition In Reverberant Environments (ASpIRE) challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Matthew Kam,et al.  Formalizing expert knowledge for developing accurate speech recognizers , 2013, INTERSPEECH.

[3]  Meredith Ringel Morris,et al.  Accessible Crowdwork?: Understanding the Value in and Challenge of Microtask Employment for People with Disabilities , 2015, CSCW.

[4]  Meredith Ringel Morris,et al.  Gauging Receptiveness to Social Microvolunteering , 2015, CHI.

[5]  Mike Wald Crowdsourcing correction of speech recognition captioning errors , 2011, W4A.

[6]  Gregg C. Vanderheiden,et al.  Crowd caption correction (CCC) , 2013, ASSETS.

[7]  Alexander I. Rudnicky,et al.  Interactive ASR Error Correction for Touchscreen Devices , 2008, ACL.

[8]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[9]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Wonkyum Lee,et al.  Semi-supervised training in low-resource ASR and KWS , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  James Baker,et al.  A historical perspective of speech recognition , 2014, CACM.

[12]  Jen-Tzung Chien,et al.  Large-Vocabulary Continuous Speech Recognition Systems: A Look at Some Recent Advances , 2012, IEEE Signal Processing Magazine.

[13]  Jeffrey P. Bigham,et al.  Crowdsourcing subjective fashion advice using VizWiz: challenges and opportunities , 2012, ASSETS '12.

[14]  Shankar Kumar,et al.  Large Scale Language Modeling in Automatic Speech Recognition , 2012, ArXiv.

[15]  Yashesh Gaur,et al.  Using keyword spotting to help humans correct captioning faster , 2015, INTERSPEECH.

[16]  Walter S. Lasecki,et al.  Answering visual questions with conversational crowd assistants , 2013, ASSETS.

[17]  Richard Dufour,et al.  Correcting asr outputs: Specific solutions to specific errors in French , 2008, 2008 IEEE Spoken Language Technology Workshop.

[18]  Matthew Kam,et al.  Enabling the Rapid Development and Adoption of Speech-User Interfaces , 2014, Computer.

[19]  Haoqi Zhang,et al.  An Iterative Dual Pathway Structure for Speech-to-Text Transcription , 2011, Human Computation.

[20]  Stuart C. Shapiro,et al.  Encyclopedia of artificial intelligence, vols. 1 and 2 (2nd ed.) , 1992 .

[21]  Erin Brady,et al.  Crowdsourcing Accessibility: Human-Powered Access Technologies , 2015, Found. Trends Hum. Comput. Interact..

[22]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[23]  Michael S. Bernstein,et al.  Human-Computer Interaction and Collective Intelligence , 2014 .

[24]  Walter S. Lasecki,et al.  Online quality control for real-time crowd captioning , 2012, ASSETS '12.

[25]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[26]  Walter S. Lasecki,et al.  Warping time for more effective real-time crowdsourcing , 2013, CHI.

[27]  Roger K. Moore Progress and Prospects for Speech Technology: Results from Three Sexennial Surveys , 2011, INTERSPEECH.

[28]  Sebastian Stüker,et al.  Evaluation of interactive user corrections for lecture transcription , 2012, IWSLT.

[29]  Ye-Yi Wang,et al.  Is word error rate a good indicator for spoken language understanding accuracy , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[30]  Kenneth N. Stevens,et al.  Speech recognition: A model and a program for research , 1962, IRE Trans. Inf. Theory.

[31]  Richard E. Ladner,et al.  The design of human-powered access technology , 2011, ASSETS.

[32]  James R. Glass,et al.  A Transcription Task for Crowdsourcing with Automatic Quality Control , 2011, INTERSPEECH.

[33]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[34]  Chris Callison-Burch,et al.  Creating Speech and Language Data With Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[35]  Walter S. Lasecki,et al.  Real-time captioning by groups of non-experts , 2012, UIST.

[36]  Walter S. Lasecki,et al.  Legion scribe: real-time captioning by non-experts , 2014, ASSETS.

[37]  Michael S. Bernstein,et al.  Crowds in two seconds: enabling realtime crowd-powered interfaces , 2011, UIST.

[38]  Alexander Gruenstein,et al.  Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.