Quality Assessment of Crowdsourcing Transcriptions for African Languages

We evaluate the quality of speech transcriptions acquired by crowdsourcing to develop ASR acoustic models (AM) for under-resourced languages. We have developed AMs using reference (REF) transcriptions and transcriptions from crowdsourcing (TRK) for Swahili and Amharic. While the Amharic transcription was much slower than that of Swahili to complete, the speech recognition systems developed using REF and TRK transcriptions have almost similar (40.1 vs 39.6 for Amharic and 38.0 vs 38.5 for Swahili) word recognition error rate. Moreover, the character level disagreement rates between REF and TRK are only 3.3% and 6.1% for Amharic and Swahili, respectively. We conclude that it is possible to acquire quality transcriptions from the crowd for under-resourced languages using Amazon’s Mechanical Turk. Recognizing such a great potential of it, we recommend some legal and ethical issues to consider.

[1]  Ian McGraw,et al.  A self-labeling speech corpus: collecting spoken words with an online educational game , 2009, INTERSPEECH.

[2]  Chris Callison-Burch,et al.  Creating Speech and Language Data With Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[3]  Ian McGraw,et al.  A self-transcribing speech corpus: collecting continuous speech with an online educational game , 2009, SLaTE.

[4]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[5]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk to Transcribe and Annotate Meeting Speech for Extractive Summarization , 2010, Mturk@HLT-NAACL.

[6]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Etienne Barnard,et al.  Speech Technology for Information Access: a South African Case Study , 2010, AAAI Spring Symposium: Artificial Intelligence for Development.

[8]  James R. Glass,et al.  Collecting Voices from the Cloud , 2010, LREC.

[9]  Ian R. Lane,et al.  Tools for Collecting Speech Corpora via Mechanical-Turk , 2010, Mturk@HLT-NAACL.

[10]  Jg Macqueen The Encyclopaedia of Languages and Linguistics , 1992 .

[11]  Solomon Teferra Abate,et al.  Morpheme-based automatic speech recognition for a morphologically rich language - Amharic , 2010, SLTU.

[12]  Panagiotis G. Ipeirotis Demographics of Mechanical Turk , 2010 .

[13]  Solomon Teferra Abate,et al.  An Amharic speech corpus for large vocabulary continuous speech recognition , 2005, INTERSPEECH.

[14]  Johan Schalkwyk,et al.  Voice search for development , 2010, INTERSPEECH.

[15]  Klaus Zechner,et al.  Using Amazon Mechanical Turk for Transcription of Non-Native Speech , 2010, Mturk@HLT-NAACL.

[16]  Chris Callison-Burch,et al.  Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription , 2010, NAACL.