A Crowdsourcing-based Approach for Speech Corpus Transcription Case of Arabic Algerian Dialects

In this paper we describe a corpus annotation project based on crowdsourcing technique that performs orthographic transcription of KALAM’DZ corpus (Bougrine et al., 2017c). This latter is a speech corpus dedicated to Arabic Algerian dialectal varieties. The recourse to crowdsourcing solution is deployed to avoid time and cost consuming solutions that involves experts. Since Arabic dialects have no standard orthographic, we have fixed some guidelines that helps crowd to get more normalized transcriptions. We have performed experiments on a sample of 10% of KALAM’DZ corpus, totaling 8.75 hours. The quality control of the output transcription is ensured within three stages: Pre-qualification of crowd, online filtering and in lab validation and revision. A baseline resource is used to evaluate both first stages. It consists on 5% of the targeted dataset transcribed by well trained transcribers. Our results confirm that the crowdsourcing solution is an effective approach for speech dialect transcription when we deal with under-resourced dialects. Before the validation of the well trained transcribers the accuracy of transcriptions reached 74.38. In addition, we present a set of best practices for crowdsourcing speech corpus transcription.

[1]  Paul Boersma,et al.  Speak and unSpeak with P RAATRAAT , 2002 .

[2]  Karima Meftouh,et al.  An Algerian dialect: Study and Resources , 2016, International Journal of Advanced Computer Science and Applications.

[3]  Samantha Wray,et al.  Best Practices for Crowdsourcing Dialectal Arabic Speech Transcription , 2015, ANLP@ACL.

[4]  Mark Liberman,et al.  Transcriber: a free tool for segmenting, labeling and transcribing speech , 1998, LREC.

[5]  James R. Glass,et al.  The MGB-2 challenge: Arabic multi-dialect broadcast media recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[6]  Mohamed Embarki,et al.  Les dialectes arabes modernes : état et nouvelles perspectives pour la classification géo-sociologique , 2008 .

[7]  K. Almeman,et al.  Multi dialect Arabic speech parallel corpora , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[8]  M. Maamouri,et al.  Dialectal Arabic Telephone Speech Corpus : Principles , Tool design , and Transcription Conventions , 2004 .

[9]  Mohamed Abdelmageed Mansour,et al.  The Absence of Arabic Corpus Linguistics: A Call for Creating an Arabic National Corpus , 2013 .

[10]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[11]  Djelloul Ziadi,et al.  Hierarchical Classification for Spoken Arabic Dialect Identification using Prosody: Case of Algerian Dialects , 2017, ArXiv.

[12]  Fayez A. Alhargan,et al.  Saudi accented Arabic voice bank , 2008, ExLing.

[13]  Philippe Blache,et al.  Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation , 2015, Res. Comput. Sci..

[14]  Lamia Hadrich Belguith,et al.  Orthographic Transcription for Spoken Tunisian Arabic , 2013, CICLing.

[15]  No Value,et al.  The Encyclopaedia of Islam, New Edition , 2000 .

[16]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[18]  Ahmed Abdelali,et al.  Altruistic Crowdsourcing for Arabic Speech Corpus Annotation , 2017, ACLING.

[19]  Dominique Caubet,et al.  Questionnaire de dialectologie du Maghreb (d'après les travaux de W. Marçais, M. Cohen, G.S. Colin, J. Cantineau, D. Cohen, Ph. Marçais, S. Lévy, etc.) , 2000 .

[20]  Soumia Bougrine,et al.  Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties , 2017, WANLP@EACL.

[21]  Sameer Khurana,et al.  QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[22]  Khalid Choukri,et al.  Network of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus , 2004, LREC.

[23]  Michael Vitale,et al.  The Wisdom of Crowds , 2015, Cell.

[24]  Nizar Habash,et al.  A Conventional Orthography for Algerian Arabic , 2015, ANLP@ACL.