Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.

[1]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[2]  Iryna Gurevych,et al.  Noise or additional information? Leveraging crowdsource annotation item agreement for natural language tasks. , 2015, EMNLP.

[3]  Nitin Madnani,et al.  They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems , 2011, ACL.

[4]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[5]  Mark J. F. Gales,et al.  Improving multiple-crowd-sourced transcriptions using a speech recogniser , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Leib Litman,et al.  The relationship between motivation, monetary compensation, and data quality among US- and India-based workers on Mechanical Turk , 2014, Behavior Research Methods.

[7]  Lora Aroyo,et al.  Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation , 2015, AI Mag..

[8]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[9]  Delyth Prys,et al.  Developing further speech recognition resources for Welsh , 2014 .

[10]  D. L. Wehmeyer Strength in numbers. , 1997, Texas medicine.

[11]  Michael S. Bernstein,et al.  Embracing Error to Enable Rapid Crowdsourcing , 2016, CHI.

[12]  Hansjörg Hofmann,et al.  Evaluation of Crowdsourced User Input Data for Spoken Dialog Systems , 2015, SIGDIAL Conference.

[13]  Patrick Siehndel,et al.  Breaking Bad: Understanding Behavior of Crowd Workers in Categorization Microtasks , 2015, HT.

[14]  Paula Buttery,et al.  Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English , 2015, TSD.

[15]  Paula Buttery,et al.  The effect of disfluencies and learner errors on the parsing of spoken learner language , 2014 .

[16]  Cynthia G. Clopper,et al.  Perceptual subcategories within non-native English , 2016, J. Phonetics.

[17]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[18]  Jonathan Rodden,et al.  Strength in Numbers? , 2002 .

[19]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[20]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[21]  Nicolas Ballier,et al.  Developing corpus interoperability for phonetic investigation of learner corpora , 2013 .

[22]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.