Arabic natural language processing: An overview

Arabic is recognised as the 4th most used language of the Internet. Arabic has three main varieties: (1) classical Arabic (CA), (2) Modern Standard Arabic (MSA), (3) Arabic Dialect (AD). MSA and AD could be written either in Arabic or in Roman script (Arabizi), which corresponds to Arabic written with Latin letters, numerals and punctuation. Due to the complexity of this language and the number of corresponding challenges for NLP, many surveys have been conducted, in order to synthesise the work done on Arabic. However these surveys principally focus on two varieties of Arabic (MSA and AD, written in Arabic letters only), they are slightly old (no such survey since 2015) and therefore do not cover recent resources and tools. To bridge the gap, we propose a survey focusing on 90 recent research papers (74% of which were published after 2015). Our study presents and classifies the work done on the three varieties of Arabic, by concentrating on both Arabic and Arabizi, and associates each work to its publicly available resources whenever available.

[1]  Thomas Breuel,et al.  Sequence-to-sequence neural network models for transliteration , 2016, ArXiv.

[2]  Owen Rambow,et al.  SLSA: A Sentiment Lexicon for Standard Arabic , 2015, EMNLP.

[3]  Christof Monz,et al.  A Simple but Effective Approach to Improve Arabizi-to-English Statistical Machine Translation , 2016, NUT@COLING.

[4]  Karim Bouzoubaa,et al.  Building a Moroccan dialect electronic Dictionary (MDED) , 2014 .

[5]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[6]  Nizar Habash,et al.  A Conventional Orthography for Algerian Arabic , 2015, ANLP@ACL.

[7]  Rehab M. Duwairi,et al.  Sentiment analysis for Arabizi text , 2016, 2016 7th International Conference on Information and Communication Systems (ICICS).

[8]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[9]  Nora Al-Twairesh,et al.  SUAR: Towards Building a Corpus for the Saudi Dialect , 2018, ACLING.

[10]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[11]  Motaz Saad,et al.  WikiDocsAligner: An Off-the-Shelf Wikipedia Documents Alignment Tool , 2017, 2017 Palestinian International Conference on Information and Communication Technology (PICICT).

[12]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[13]  Teddy Surya Gunawan,et al.  Development of Quran Reciter Identification System Using MFCC and Neural Network , 2016 .

[14]  Lamia Hadrich Belguith,et al.  Morphological Analysis of Tunisian Dialect , 2013, IJCNLP.

[15]  Nizar Habash,et al.  A Large Scale Arabic Sentiment Lexicon for Arabic Opinion Mining , 2014, ANLP@EMNLP.

[16]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[17]  Luis Alfonso Ureña López,et al.  OCA: Opinion corpus for Arabic , 2011, J. Assoc. Inf. Sci. Technol..

[18]  Amir F. Atiya,et al.  LABR: A Large Scale Arabic Book Reviews Dataset , 2013, ACL.

[19]  Eric Atwell,et al.  QurAna: Corpus of the Quran annotated with Pronominal Anaphora , 2012, LREC.

[20]  Izzat Alsmadi,et al.  A topical classification of Quranic arabic text , 2013 .

[21]  Muhammad Abdul-Mageed,et al.  AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis , 2012, LREC.

[22]  Karima Meftouh,et al.  Maghrebi Arabic dialect processing: an overview , 2017 .

[23]  Alexander Erdmann,et al.  Unified Guidelines and Resources for Arabic Dialect Orthography , 2018, LREC.

[24]  Christopher D. Manning,et al.  Word Segmentation of Informal Arabic with Domain Adaptation , 2014, ACL.

[25]  Karim Bouzoubaa,et al.  Automatic Identification of Moroccan Colloquial Arabic , 2017, ICALP.

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Hend Suliman Al-Khalifa,et al.  AraSenTi: Large-Scale Twitter-Specific Arabic Sentiment Lexicons , 2016, ACL.

[28]  Nizar Habash,et al.  Improving Arabic Diacritization through Syntactic Analysis , 2015, EMNLP.

[29]  Samhaa R. El-Beltagy,et al.  NileULex: A Phrase and Word Level Sentiment Lexicon for Egyptian and Modern Standard Arabic , 2016, LREC.

[30]  Kamel Smaïli,et al.  CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube , 2017, INTERSPEECH.

[31]  Muhammad Abdul-Mageed,et al.  SANA: A Large Scale Multi-Genre, Multi-Dialect Lexicon for Arabic Subjectivity and Sentiment Analysis , 2014, LREC.

[32]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[33]  Jörg Tiedemann Improved Sentence Alignment for Movie Subtitles , 2007 .

[34]  Eiichiro Sumita,et al.  Multilingual Spoken Language Corpus Development for Communication Research , 2006, ROCLING/IJCLCLP.

[35]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[36]  S. Khudanpur,et al.  Translations of the Callhome Egyptian Arabic corpus for conversational speech translation , 2014, IWSLT.

[37]  Nizar Habash,et al.  Don’t Throw Those Morphological Analyzers Away Just Yet: Neural Morphological Disambiguation for Arabic , 2017, EMNLP.

[38]  Stergios Chatzikyriakidis,et al.  Shami: A Corpus of Levantine Arabic Dialects , 2018, LREC.

[39]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[40]  Ahmed Guessoum,et al.  Building TALAA, a Free General and Categorized Arabic Corpus , 2015, ICAART.

[41]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[42]  Abdulhadi Shoufan,et al.  Natural Language Processing for Dialectical Arabic: A Survey , 2015, ANLP@ACL.

[43]  Andrea Esuli,et al.  SentiWordNet: A High-Coverage Lexical Resource for Opinion Mining , 2006 .

[44]  Samhaa R. El-Beltagy,et al.  AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP , 2017, ACLING.

[45]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[46]  Faiçal Azouaou,et al.  ASDA : Analyseur Syntaxique du Dialecte Alg{é}rien dans un but d'analyse s{é}mantique , 2017, ArXiv.

[47]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[48]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[49]  Alexander Erdmann,et al.  Addressing Noise in Multidialectal Word Embeddings , 2018, ACL.

[50]  K. Almeman,et al.  Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[51]  Mahmoud El-Haj,et al.  Arabic Dialect Identification in the Context of Bivalency and Code-Switching , 2018, LREC.

[52]  Kareem Darwish,et al.  Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[53]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[54]  Mohamed Ali,et al.  Character Level Convolutional Neural Network for Arabic Dialect Identification , 2018, VarDial@COLING 2018.

[55]  Jonathan May An Arabizi-English social media statistical machine translation system , 2014, AMTA.

[56]  Nizar Habash,et al.  Morphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic , 2016, LREC.

[57]  Nizar Habash,et al.  Automatic Transliteration of Romanized Dialectal Arabic , 2014, CoNLL.

[58]  Faiçal Azouaou,et al.  Hybrid approach for transliteration of Algerian arabizi: a primary study , 2018, ArXiv.

[59]  Nizar Habash,et al.  YAMAMA: Yet Another Multi-Dialect Arabic Morphological Analyzer , 2016, COLING.

[60]  Muhammad Abdul-Mageed,et al.  You Tweet What You Speak: A City-Level Dataset of Arabic Dialects , 2018, LREC.

[61]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[62]  Laura Kallmeyer,et al.  Multi-Dialect Arabic POS Tagging: A CRF Approach , 2018, LREC.

[63]  Amir Hussain,et al.  Arabizi sentiment analysis based on transliteration and automatic corpus annotation , 2018, WASSA@EMNLP.

[64]  Nizar Habash,et al.  Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script , 2014, CodeSwitch@EMNLP.

[65]  Raddouane Chiheb,et al.  Sentiment analysis in Arabic: A review of the literature , 2017, Ain Shams Engineering Journal.

[66]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[67]  Imane GUELLIL,et al.  Lexicon for Algerian Arabic Dialect Treatment in Social Media , 2017 .

[68]  Alexander Erdmann,et al.  Noise-Robust Morphological Disambiguation for Dialectal Arabic , 2018, NAACL.

[69]  Kheireddine Abainia,et al.  A novel robust Arabic light stemmer , 2017, J. Exp. Theor. Artif. Intell..

[70]  Faiçal Azouaou,et al.  Arabic Dialect Identification with an Unsupervised Learning (Based on a Lexicon). Application Case: ALGERIAN Dialect , 2016, 2016 IEEE Intl Conference on Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES).

[71]  Nizar Habash,et al.  CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing , 2018, LREC.

[72]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[73]  Nizar Habash,et al.  ADAM: Analyzer for Dialectal Arabic Morphology , 2014, J. King Saud Univ. Comput. Inf. Sci..

[74]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[75]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[76]  Nizar Habash,et al.  Curras: an annotated corpus for the Palestinian Arabic dialect , 2017, Lang. Resour. Evaluation.

[77]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[78]  Walid Magdy,et al.  Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM , 2018, LREC.

[79]  Yuji Matsumoto,et al.  A Parallel Corpus of Arabic-Japanese News Articles , 2018, LREC.

[80]  Owen Rambow,et al.  DIWAN: A Dialectal Word Annotation Tool for Arabic , 2015, ANLP@ACL.

[81]  Pengfei Duan,et al.  Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification , 2016, COLING.

[82]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[83]  Nizar Habash,et al.  Universal Dependencies for Arabic , 2017, WANLP@EACL.

[84]  Comparison between Neural and Sta- tistical translation after translitera- tion of Algerian Arabic Dialect , 2017 .

[85]  Saif Mohammad,et al.  Sentiment Lexicons for Arabic Social Media , 2016, LREC.

[86]  Karima Meftouh,et al.  PADIC: extension and new experiments , 2018 .

[87]  Nizar Habash,et al.  An Arabic Dependency Treebank in the Travel Domain , 2018, ArXiv.

[88]  Rahma Sellami,et al.  Collaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application - the case of Tunisian Arabic and the Social Media , 2014, LG-LP@COLING.

[89]  James R. Glass,et al.  Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition , 2018, Odyssey.

[90]  Laura Kallmeyer,et al.  A Neural Architecture for Dialectal Arabic Segmentation , 2017, WANLP@EACL.

[91]  Aida Mustapha,et al.  Comparative Analysis of Text Classification Algorithms for Automated Labelling of Quranic Verses. , 2017 .

[92]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[93]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[94]  Eric Atwell,et al.  QurSim: A corpus for evaluation of relatedness in short texts , 2012, LREC.

[95]  Houda Bouamor,et al.  Fine-Grained Arabic Dialect Identification , 2018, COLING.

[96]  Amar Balla,et al.  Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.

[97]  Matthew Lease,et al.  ArabicWeb16: A New Crawl for Today's Arabic Web , 2016, SIGIR.

[98]  Nizar Habash,et al.  A Morphological Analyzer for Gulf Arabic Verbs , 2017, WANLP@EACL.

[99]  Hend Suliman Al-Khalifa,et al.  AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets , 2017, ACLING.

[100]  Nizar Habash,et al.  A Large Scale Corpus of Gulf Arabic , 2016, LREC.

[101]  Maite Taboada,et al.  Lexicon-Based Methods for Sentiment Analysis , 2011, CL.

[102]  Nizar Habash,et al.  A Morphologically Annotated Corpus of Emirati Arabic , 2018, LREC.

[103]  Christophe Garcia,et al.  ALIF: A dataset for Arabic embedded text recognition in TV broadcast , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[104]  Khaled Shaalan,et al.  Sentiment Analysis in Arabic , 2015, NLDB.

[105]  Fethi Bougares,et al.  Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments , 2017, WANLP@EACL.

[106]  Preslav Nakov,et al.  Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) , 2016 .

[107]  Nizar Habash,et al.  CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation , 2016, COLING.

[108]  Nizar Habash,et al.  Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development , 2014, LREC.

[109]  Amir Hussain,et al.  SentiALG: Automated Corpus Annotation for Algerian Sentiment Analysis , 2018, BICS.

[110]  Nizar Habash,et al.  Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus , 2014, ANLP@EMNLP.

[111]  Hazem M. Hajj,et al.  EMA at SemEval-2018 Task 1: Emotion Mining for Arabic , 2018, *SEMEVAL.

[112]  Mahmoud El-Haj,et al.  KALIMAT a multipurpose Arabic corpus , 2013 .

[113]  Eric Atwell,et al.  Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank , 2010, LREC.

[114]  Tamer Elsayed,et al.  DART: A Large Dataset of Dialectal Arabic Tweets , 2018, LREC.

[115]  Kemal Oflazer,et al.  MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction , 2018, LREC.

[116]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .