Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription

Deploying an automatic speech recognition system with reasonable performance requires expensive and time-consuming in-domain transcription. Previous work demonstrated that non-professional annotation through Amazon's Mechanical Turk can match professional quality. We use Mechanical Turk to transcribe conversational speech for as little as one thirtieth the cost of professional transcription. The higher disagreement of non-professional transcribers does not have a significant effect on system performance. While previous work demonstrated that redundant transcription can improve data quality, we found that resources are better spent collecting more data. Finally, we describe a quality control method without needing professional transcription.

[1]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[2]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[4]  Seung-Shik Kang,et al.  Automatic Segmentation of Words using Syllable Bigram Statistics , 2001, NLPRS.

[5]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[6]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[7]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[8]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Richard M. Schwartz,et al.  The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[10]  Deb Roy,et al.  Fast transcription of unstructured audio recordings , 2009, INTERSPEECH.

[11]  John Makhoul,et al.  Using quick transcriptions to improve conversational speech models , 2004, INTERSPEECH.

[12]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jean-Luc Gauvain,et al.  Lightly Supervised Acoustic Model Training , 2000 .

[14]  Chris Callison-Burch,et al.  Feasibility of Human-in-the-loop Minimum Error Rate Training , 2009, EMNLP.

[15]  Roger K. Moore Computer Speech and Language , 1986 .

[16]  Janet M. Baker,et al.  Application of large vocabulary continuous speech recognition to topic and speaker identification using telephone speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[18]  Hui Ye,et al.  The Hidden Information State Approach to Dialog Management , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  Richard M. Schwartz,et al.  Unsupervised versus supervised training of acoustic models , 2008, INTERSPEECH.

[20]  Ian McGraw,et al.  A self-labeling speech corpus: collecting spoken words with an online educational game , 2009, INTERSPEECH.