Measuring the acceptable word error rate of machine-generated webcast transcripts

The increased availability of broadband connections has recently led to an increase in the use of Internet broadcasting (webcasting). Most webcasts are archived and accessed numerous times retrospectively. One of the hurdles users face when browsing and skimming through archives is the lack of text transcripts of the audio channel of the webcast archive. In this paper, we proposed a procedure for prototyping an Automatic Speech Recognition (ASR) system that generates realistic transcripts of any desired Word Error Rate (WER), thus overcoming the drawbacks of both prototypebased and Wizard of Oz simulations. We used such a system in a study where human subjects perform question-answering tasks using archives of webcast lectures, and showed that their performance and perception of transcript quality is linearly affected by WER, and that transcripts of WER equal or less than 25% would be acceptable for use in webcast archives. Index Terms: Speech recognition, Wizard of Oz, Prototyping, User interface, Webcast.

[1]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[2]  Elaine Toms,et al.  User strategies for handling information tasks in webcasts , 2005, CHI EA '05.

[3]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[4]  Niels Ole Bernsen,et al.  Designing interactive speech systems - from first ideas to user testing , 1998 .

[5]  Niels Ole Bernsen,et al.  Designing Interactive Speech Systems , 1998, Springer London.

[6]  Thomas Schaaf,et al.  Lecture and presentation tracking in an intelligent meeting room , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[7]  Lori Lamel,et al.  Data collection for the MASK kiosk: WOz vs. prototype system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Mauro Cettolo,et al.  Language modeling and transcription of the TED corpus lectures , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[9]  Elaine Toms,et al.  Assessing tools for use with webcasts , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[10]  Mary LaLomia User acceptance of handwritten recognition accuracy , 1994, CHI '94.

[11]  Chris Schmandt,et al.  Nomadic radio: speech and audio interaction for contextual messaging in nomadic environments , 2000, TCHI.

[12]  Ron Van Buskirk,et al.  The just noticeable difference of speech recognition accuracy , 1995, CHI '95.

[13]  Julia Hirschberg,et al.  ASR satisficing: the effects of ASR accuracy on speech retrieval , 2000, INTERSPEECH.

[14]  Tatsuya Kawahara,et al.  Automatic transcription of lecture speech using topic-independent language modeling , 2000, INTERSPEECH.

[15]  James R. Glass,et al.  Automatic processing of audio lectures for information retrieval: vocabulary selection and language modeling , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16]  Julia Hirschberg,et al.  Look or Listen: Discovering Effective Techniques for Accessing Speech Data , 2004 .

[17]  Barry Arons,et al.  SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[18]  Aaron E. Rosenberg,et al.  SCANMail: a voicemail interface that makes speech browsable, readable and searchable , 2002, CHI.

[19]  Elaine Toms,et al.  The effect of speech recognition accuracy rates on the usefulness and usability of webcast archives , 2006, CHI.

[20]  Ronald Baecker A principled design for scalable internet visual communications with rich media, interactivity, and structured archives , 2003, CASCON.