Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation

Corpora are considered as an important resource for natural language processing (NLP). Currently, the Dialectal Arabic corpora are somewhat limited, particularly in the case of the Tunisian Arabic. In recent years, since the events of the revolution, the increasing presence of spoken Tunisian Arabic in interviews, news and debate programs, the increasing use of language technologies for many spoken languages (e.g., Siri) [6], and the need for works on speech technologies requires a huge amount of well-designed Tunisian spoken corpora. This paper presents the “STAC” corpus (Spoken Tunisian Arabic Corpus) of spontaneous Tunisian Arabic speech. We present our method used for the collection and the transcription of this corpus. Then, we detail the different stages done to enrich the corpus with necessary linguistic and speech annotations that makes it more useful for many NLP applications.

[1]  Kemal Oflazer,et al.  YouDACC: the Youtube Dialectal Arabic Comment Corpus , 2014, LREC.

[2]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[3]  Wolfgang Nejdl,et al.  Music Mood and Theme Classification - a Hybrid Approach , 2009, ISMIR.

[4]  Lamia Hadrich Belguith,et al.  Morphological Analysis of Tunisian Dialect , 2013, IJCNLP.

[5]  Tanja Schultz,et al.  Towards language portability in statistical speech translation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Lamia Hadrich Belguith,et al.  Discriminative Framework for Spoken Tunisian Dialect Understanding , 2013, SLSP.

[7]  Kareem Darwish,et al.  Using Twitter to Collect a Multi-Dialectal Corpus of Arabic , 2014, ANLP@EMNLP.

[8]  Lamia Hadrich Belguith,et al.  Orthographic Transcription for Spoken Tunisian Arabic , 2013, CICLing.

[9]  Gérald Purnelle,et al.  Normalizing speech transcriptions for Natural Language Processing , 2009 .

[10]  Kevin Duh,et al.  Lexicon Acquisition for Dialectal Arabic Using Transductive Learning , 2006, EMNLP.

[11]  Kristin Precoda,et al.  Iraqcomm: a next generation translation system , 2007, INTERSPEECH.

[12]  Roxane Bertrand,et al.  Orthographic Transcription: which enrichment is required for phonetization? , 2012, LREC.

[13]  Slim Abdennadher,et al.  Modern standard Arabic based multilingual approach for dialectal Arabic speech recognition , 2009, 2009 Eighth International Symposium on Natural Language Processing.

[14]  Nizar Habash,et al.  Building a Corpus for Palestinian Arabic: a Preliminary Study , 2014, ANLP@EMNLP.

[15]  Samira Moukrim Fsahy Morphosyntaxe et sémantique du "présent " : une étude contrastive à partir de corpus oraux : arabe marocain, berbère tamazight et français (ESLO/LCO) , 2010 .

[16]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[17]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[18]  Fethi Bougares,et al.  Phonetic tool for the Tunisian Arabic , 2014, SLTU.

[19]  Lamia Hadrich Belguith,et al.  Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora , 2013, IJCNLP.

[20]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[21]  Fayez A. Alhargan,et al.  Saudi accented Arabic voice bank , 2008, ExLing.

[22]  S. Lawson,et al.  Codeswitching in Tunisia: Attitudinal and behavioural dimensions , 2000 .

[23]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.

[24]  K. Almeman,et al.  Multi dialect Arabic speech parallel corpora , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[25]  M. Maamouri,et al.  Dialectal Arabic Telephone Speech Corpus : Principles , Tool design , and Transcription Conventions , 2004 .

[26]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[27]  James F. Allen,et al.  Deyecting and Correcting Speech Repairs , 1994, ACL.