论文信息 - Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning

Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning

The identification of the language of text/speech input is the first step to be able to properly do any language-dependent natural language processing. The task is called Automatic Language Identification (ALI). Being a well-studied field since early 1960’s, various methods have been applied to many standard languages. The ALI standard methods require datasets for training and use character/word-based n-gram models. However, social media and new technologies have contributed to the rise of informal and minority languages on the Web. The state-of-the-art automatic language identifiers fail to properly identify many of them. Romanized Arabic (RA) and Romanized Berber (RB) are cases of these informal languages which are under-resourced. The goal of this paper is twofold: detect RA and RB, at a document level, as separate languages and distinguish between them as they coexist in North Africa. We consider the task as a classification problem and use supervised machine learning to solve it. For both languages, character-based 5-grams combined with additional lexicons score the best, F-score of 99.75% and 97.77% for RB and RA respectively.

[1] Mona T. Diab,et al. COLABA : Arabic Dialect Annotation and Processing , 2011 .

[2] Jonathan Owens,et al. Codeswitching and Related Issues Involving Arabic , 2013 .

[3] David Sankofl,et al. The Production of Code-Mixed Discourse , 1998, ACL 1998.

[4] Timothy Baldwin,et al. langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[5] Chris Callison-Burch,et al. The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[6] Kareem Darwish,et al. Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[7] Marine Carpuat,et al. The NRC System for Discriminating Similar Languages , 2014, VarDial@COLING.

[8] Nizar Habash,et al. Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script , 2014, CodeSwitch@EMNLP.

[9] Nizar Habash,et al. Automatic Transliteration of Romanized Dialectal Arabic , 2014, CoNLL.

[10] Shervin Malmasi,et al. Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[11] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[12] Houda Saadane,et al. Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques , 2015 .

[13] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14] David Sankoff,et al. The production of code-mixed discourse , 2002, COLING.