Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification

In this paper, we present Arap-Tweet, which is a large-scale and multi-dialectal corpus of Tweets from 11 regions and 16 countries in the Arab world representing the major Arabic dialectal varieties. To build this corpus, we collected data from Twitter and we provided a team of experienced annotators with annotation guidelines that they used to annotate the corpus for age categories, gender, and dialectal variety. During the data collection effort, we based our search on distinctive keywords that are specific to the different Arabic dialects and we also validated the location using Twitter API. In this paper, we report on the corpus data collection and annotation efforts. We also present some issues that we encountered during these phases. Then, we present the results of the evaluation performed to ensure the consistency of the annotation. The provided corpus will enrich the limited set of available language resources for Arabic and will be an invaluable enabler for developing author profiling tools and NLP tools for Arabic.

[1]  H. Sawaf Arabic Dialect Handling in Hybrid Machine Translation , 2010, AMTA.

[2]  Wajdi Zaghouani,et al.  A Pilot PropBank Annotation for Quranic Arabic , 2012, CLfL@NAACL-HLT.

[3]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[4]  Lamia Hadrich Belguith,et al.  Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora , 2013, IJCNLP.

[5]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[6]  Alexander Erdmann,et al.  Unified Guidelines and Resources for Arabic Dialect Orthography , 2018, LREC.

[7]  Kemal Oflazer,et al.  Correction Annotation for Non-Native Arabic Texts: Guidelines and Corpus , 2015, LAW@NAACL-HLT.

[8]  Kemal Oflazer,et al.  Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[9]  Paolo Rosso,et al.  A survey on author profiling, deception, and irony detection for the Arabic language , 2018, Lang. Linguistics Compass.

[10]  Seth Kulick,et al.  From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News , 2010, LREC.

[11]  Yonatan Belinkov,et al.  Translating Dialectal Arabic to English , 2013, ACL.

[12]  Roxana Girju,et al.  Mining the Web for the Induction of a Dialectical Arabic Lexicon , 2010, LREC.

[13]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[14]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[15]  Kemal Oflazer,et al.  Building an Arabic Machine Translation Post-Edited Corpus: Guidelines and Annotation , 2016, LREC.

[16]  Kareem Darwish,et al.  Using Twitter to Collect a Multi-Dialectal Corpus of Arabic , 2014, ANLP@EMNLP.

[17]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[18]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[19]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[20]  Nizar Habash,et al.  Arabic Dialect Processing Tutorial , 2012, HLT-NAACL.

[21]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[22]  Wajdi Zaghouani,et al.  Guidelines and Annotation Framework for Arabic Author Profiling , 2018, ArXiv.

[23]  Kemal Oflazer,et al.  YouDACC: the Youtube Dialectal Arabic Comment Corpus , 2014, LREC.

[24]  Wajdi Zaghouani Critical Survey of the Freely Available Arabic Corpora , 2017, ArXiv.

[25]  Ann Bies,et al.  A Pilot Arabic Propbank , 2008, LREC.

[26]  Fabio Pianesi,et al.  Workshop on Computational Personality Recognition: Shared Task , 2013, Proceedings of the International AAAI Conference on Web and Social Media.

[27]  Martha Palmer,et al.  The Revised Arabic PropBank , 2010, Linguistic Annotation Workshop.

[28]  Kemal Oflazer,et al.  Guidelines and Framework for a Large Scale Arabic Diacritized Corpus , 2016, LREC.

[29]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[30]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.