Altruistic Crowdsourcing for Arabic Speech Corpus Annotation

Abstract Crowdsourcing is an emerging collaborative approach that can be used for effective annotations of linguistic resources. There are many crowdsourcing genres: paid-for, games with a purpose, or altruistic (volunteer-based) approaches. In this paper, we investigate the use of altruistic crowdsourcing for speech corpora annotation by narrating our experience of validating a semi-automatic task for dialect annotation of Kalam’DZ, a corpus dedicated to Arabic Algerian dialectal varieties. We start by describing the whole process of designing altruistic crowdsourcing project. Using the unpaid crowdcrafting platform, we have performed experiments on a sample of 10% of Kalam’DZ corpus, totaling more than 10 h with 1012 speakers. The evaluation of this crowdsourcing job is ensured through a comparison with a gold standard annotation done by experts which affirms a high level of inter-annotation agreements of 81%. Our results confirm that altruistic crowdsourcing is an effective approach for speech dialect annotation. In addition, we present a set of best practices for altruistic crowdsourcing for corpus annotations.

[1]  Samantha Wray,et al.  Best Practices for Crowdsourcing Dialectal Arabic Speech Transcription , 2015, ANLP@ACL.

[2]  Samantha Wray,et al.  Crowdsource a little to label a lot: labeling a speech corpus of dialectal Arabic , 2015, INTERSPEECH.

[3]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[4]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[5]  Mohamed Abdelmageed Mansour,et al.  The Absence of Arabic Corpus Linguistics: A Call for Creating an Arabic National Corpus , 2013 .

[6]  S. Khudanpur,et al.  Translations of the Callhome Egyptian Arabic corpus for conversational speech translation , 2014, IWSLT.

[7]  Min-Yen Kan,et al.  Perspectives on crowdsourcing annotations for natural language processing , 2012, Language Resources and Evaluation.

[8]  Slim Abdennadher,et al.  tashkeelWAP : A Game With A Purpose For Digitizing Arabic Diacritics , 2017 .

[9]  Soumia Bougrine,et al.  Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties , 2017, WANLP@EACL.

[10]  Nicolas Kaufmann,et al.  Worker Motivation in Crowdsourcing and Human Computation , 2011 .

[11]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[12]  Ria Mae Borromeo,et al.  An investigation of unpaid crowdsourcing , 2016, Human-centric Computing and Information Sciences.

[13]  Abderrahmane Amrouche,et al.  Algerian Modern Colloquial Arabic Speech Corpus (AMCASC): regional accents recognition within complex socio-linguistic environments , 2017, Lang. Resour. Evaluation.

[14]  Wajdi Zaghouani,et al.  Can Crowdsourcing be used for Effective Annotation of Arabic? , 2014, LREC.