Deep learning for Arabic NLP: A survey

Abstract The recent advances in deep learning (DL) have caused breakthroughs in many fields such as computer vision, natural language processing (NLP) and speech processing. Many DL based approaches have been shown to produce state-of-the-art results on various tasks that are of great importance to online social networks (OSN) and social computing such as sentiment analysis (SA) and pharmacovigilance. NLP tasks are becoming very prominent in OSN and DL is offering researchers and practitioners exciting new directions to address these tasks. In this paper, we provide a survey of the published papers on using DL techniques for NLP. We focus on the Arabic language due to its importance, the scarcity of resources on it and the challenges associated with working on it. We notice that DL has yet to receive the attention it deserves from the Arabic NLP (ANLP) community compared with the attention it is getting for other languages despite the vast adoption of social networks in the Arab world. The majority of the early works on using DL for ANLP focused on OCR-related problems while the more recent ones are more diverse with the increasing interest in applying DL to SA, machine translation, diacritization, etc. This survey should serve as a guide for the young and growing ANLP community in order to help bridge the huge gap between ANLP literature and the much richer and more mature English NLP literature.

[1]  Venu Govindaraju,et al.  Handwritten Arabic text recognition using Deep Belief Networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Monji Kherallah,et al.  A New Design Based-SVM of the CNN Classifier Architecture with Dropout for Offline Arabic Handwritten Recognition , 2016, ICCS.

[4]  Mohamed Cheriet,et al.  Feature Set Evaluation for Offline Handwriting Recognition Systems: Application to the Recurrent Neural Network Model , 2016, IEEE Transactions on Cybernetics.

[5]  Mohsen Rashwan,et al.  Automatic Arabic diacritics restoration based on deep nets , 2014, ANLP@EMNLP.

[6]  Yonatan Belinkov,et al.  A Character-level Convolutional Neural Network for Distinguishing Similar Languages and Dialects , 2016, VarDial@COLING.

[7]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[8]  Mahmoud Al-Ayyoub,et al.  Arabic sentiment analysis: Lexicon-based and corpus-based , 2013, 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[9]  Christophe Garcia,et al.  Contribution of recurrent connectionist language models in improving LSTM-based Arabic text recognition in videos , 2017, Pattern Recognit..

[10]  Yannick Estève,et al.  LIUM ASR systems for the 2016 Multi-Genre Broadcast Arabic challenge , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[11]  Rolf Ingold,et al.  A dataset for Arabic text detection, tracking and recognition in news videos- AcTiV , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[12]  Verena Rieser,et al.  An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis , 2014, LREC.

[13]  Monji Kherallah,et al.  Towards Unsupervised Learning for Arabic Handwritten Recognition Using Deep Architectures , 2015, ICONIP.

[14]  Christopher Kermorvant,et al.  The A2iA Arabic Handwritten Text Recognition System at the Open HaRT2013 Evaluation , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[15]  Monji Kherallah,et al.  Recognition of Handwritten Arabic Words with Dropout Applied in MDLSTM , 2016, ICIAR.

[16]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[17]  El-Sayed M. El-Alfy,et al.  Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis in Short Arabic Text , 2017, ANT/SEIT.

[18]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[19]  Abeed Sarker,et al.  Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features , 2015, J. Am. Medical Informatics Assoc..

[20]  Volker Märgner,et al.  On-line Arabic handwriting recognition competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[21]  Ching Y. Suen,et al.  A novel hybrid CNN-SVM classifier for recognizing handwritten digits , 2012, Pattern Recognit..

[22]  Jürgen Schmidhuber,et al.  Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[23]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[24]  Steven Skiena,et al.  POLYGLOT-NER: Massive Multilingual Named Entity Recognition , 2014, SDM.

[25]  Muhammad Imran Razzak,et al.  Deep learning based isolated Arabic scene character recognition , 2017, 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR).

[26]  Xiang Bai,et al.  Script identification in the wild via discriminative convolutional neural network , 2016, Pattern Recognit..

[27]  Monji Kherallah,et al.  A novel architecture of CNN based on SVM classifier for recognising Arabic handwritten script , 2016, Int. J. Intell. Syst. Technol. Appl..

[28]  El-Sayed M. El-Alfy,et al.  Hybrid Deep Learning for Sentiment Polarity Determination of Arabic Microblogs , 2017, ICONIP.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Mahmoud Al-Ayyoub,et al.  On authorship authentication of Arabic articles , 2014, 2014 5th International Conference on Information and Communication Systems (ICICS).

[31]  Haikal El Abed,et al.  Online Arabic Databases and Applications , 2012 .

[32]  Zaher Al Aghbari,et al.  IESK-ArDB: a database for handwritten Arabic and an optimized topological segmentation approach , 2012, International Journal on Document Analysis and Recognition (IJDAR).

[33]  Nizar Habash,et al.  OMAM at SemEval-2017 Task 4: Evaluation of English State-of-the-Art Sentiment Analysis Models for Arabic and a New Topic-based Model , 2017, *SEMEVAL.

[34]  Volker Märgner,et al.  NIST 2013 Open Handwriting Recognition and Translation (Open HaRT'13) Evaluation , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[35]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[36]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[37]  Hazem M. Hajj,et al.  Deep Learning Models for Sentiment Analysis in Arabic , 2015, ANLP@ACL.

[38]  Vasu Jindal A Personalized Markov Clustering and Deep Learning Approach for Arabic Text Categorization , 2016, ACL.

[39]  Mohsen Rashwan,et al.  Deep Learning Framework with Confused Sub-Set Resolution Architecture for Automatic Arabic Diacritization , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[41]  Minho Lee,et al.  Deep Network with Support Vector Machines , 2013, ICONIP.

[42]  Pengfei Duan,et al.  Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification , 2016, COLING.

[43]  Haris Papageorgiou,et al.  SemEval-2016 Task 5: Aspect Based Sentiment Analysis , 2016, *SEMEVAL.

[44]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[45]  Monji Kherallah,et al.  Recognizing online Arabic handwritten characters using a deep architecture , 2017, International Conference on Machine Vision.

[46]  Shady Elbassuoni,et al.  Methodical Evaluation of Arabic Word Embeddings , 2017, ACL.

[47]  Preslav Nakov,et al.  SemEval-2017 Task 3: Community Question Answering , 2017, *SEMEVAL.

[48]  Y.A. Alotaibi,et al.  Spoken Arabic digits recognizer using recurrent neural networks , 2004, Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004..

[49]  Dimosthenis Karatzas,et al.  A Fine-Grained Approach to Scene Text Script Identification , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[50]  Mohamed Cheriet,et al.  Feature Design for Offline Arabic Handwriting Recognition: Handcrafted vs Automated? , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[51]  David W. Aha,et al.  Unsupervised and transfer learning challenge , 2011, The 2011 International Joint Conference on Neural Networks.

[52]  Monji Kherallah,et al.  Improving MDLSTM for Offline Arabic Handwriting Recognition Using Dropout at Different Positions , 2016, ICANN.

[53]  Edouard Geoffrois,et al.  Results of the RIMES Evaluation Campaign for Handwritten Mail Processing , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[54]  Ahmed Bouridane,et al.  HACDB: Handwritten Arabic characters database for automatic character recognition , 2013, European Workshop on Visual Information Processing (EUVIP).

[55]  Zheng Huang,et al.  Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation , 2016, ICONIP.

[56]  C. V. Jawahar,et al.  Unconstrained scene text and video text recognition for Arabic script , 2017, 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR).

[57]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[58]  Hagen Soltau,et al.  Morpheme-based feature-rich language models using Deep Neural Networks for LVCSR of Egyptian Arabic , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[59]  Laura Kallmeyer,et al.  A Neural Architecture for Dialectal Arabic Segmentation , 2017, WANLP@EACL.

[60]  Marc-Peter Schambach,et al.  Low resolution Arabic recognition with multidimensional recurrent neural networks , 2013, MOCR '13.

[61]  Diana Inkpen,et al.  Natural Language Processing for Social Media , 2015, Natural Language Processing for Social Media.

[62]  Vasu Jindal A Deep Learning Approach for Arabic Caption Generation Using Roots-Words , 2017, AAAI.

[63]  Sameer Khurana,et al.  QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[64]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[65]  Jürgen Schmidhuber,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[66]  James R. Glass,et al.  A complete KALDI recipe for building Arabic speech recognition systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[67]  Monji Kherallah,et al.  Optimization of DBN using Regularization Methods Applied for Recognizing Arabic Handwritten Script , 2017, ICCS.

[68]  Nizar Habash,et al.  Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings , 2016, COLING.

[69]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[70]  Hazem M. Hajj,et al.  AROMA: A Recursive Deep Learning Model for Opinion Mining in Arabic as a Low Resource Language , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[71]  Jin Chen,et al.  Gabor features for offline Arabic handwriting recognition , 2010, DAS '10.

[72]  Chinnappa Guggilla,et al.  Discrimination between Similar Languages, Varieties and Dialects using CNN- and LSTM-based Deep Neural Networks , 2016, VarDial@COLING.

[73]  Umapada Pal,et al.  ICDAR2015 Competition on Video Script Identification (CVSI 2015) , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[74]  Mahmoud Al-Ayyoub,et al.  An enhanced framework for aspect-based sentiment analysis of Hotels' reviews: Arabic reviews case study , 2016, 2016 11th International Conference for Internet Technology and Secured Transactions (ICITST).

[75]  Samhaa R. El-Beltagy,et al.  NileTMRG at SemEval-2017 Task 4: Arabic Sentiment Analysis , 2017, *SEMEVAL.

[76]  Nizar Habash,et al.  A Characterization Study of Arabic Twitter Data with a Benchmarking for State-of-the-Art Opinion Mining Models , 2017, WANLP@EACL.

[77]  George Saon,et al.  The IBM BOLT speech transcription system , 2015, INTERSPEECH.

[78]  Yonatan Belinkov,et al.  Arabic Diacritization with Recurrent Neural Networks , 2015, EMNLP.

[79]  Yonatan Belinkov,et al.  Language processing and learning models for community question answering in Arabic , 2017, Inf. Process. Manag..

[80]  José-Ángel González,et al.  ELiRF-UPV at SemEval-2017 Task 4: Sentiment Analysis using Deep Learning , 2017, SemEval@ACL.

[81]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[82]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[83]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[84]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[85]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[86]  Rashad Al-Jawfi,et al.  Handwriting Arabic character recognition LeNet using neural network , 2009, Int. Arab J. Inf. Technol..

[87]  Christophe Garcia,et al.  text Detection with Convolutional Neural Networks , 2008, VISAPP.

[88]  Mahmoud Al-Ayyoub,et al.  Author gender identification from Arabic text , 2017, J. Inf. Secur. Appl..

[89]  Yoshua Bengio,et al.  Unsupervised and Transfer Learning Challenge: a Deep Learning Approach , 2011, ICML Unsupervised and Transfer Learning.

[90]  Monji Kherallah,et al.  Deep Learning for Feature Extraction of Arabic Handwritten Script , 2015, CAIP.

[91]  Monji Kherallah,et al.  Online Arabic Handwriting Recognition with Dropout Applied in Deep Recurrent Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[92]  Laura Kallmeyer,et al.  Learning from Relatives: Unified Dialectal Arabic Segmentation , 2017, CoNLL.

[93]  Karima Meftouh,et al.  Machine translation for Arabic dialects (survey) , 2017, Inf. Process. Manag..

[94]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[95]  Monji Kherallah,et al.  Arabic handwritten characters recognition using Deep Belief Neural Networks , 2015, 2015 IEEE 12th International Multi-Conference on Systems, Signals & Devices (SSD15).

[96]  Tara N. Sainath,et al.  Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[98]  Amir F. Atiya,et al.  ASTD: Arabic Sentiment Tweets Dataset , 2015, EMNLP.

[99]  Hazem M. El-Bakry,et al.  Arabic Handwritten Characters Recognition Using Convolutional Neural Network , 2017 .

[100]  Samhaa R. El-Beltagy,et al.  Combining Lexical Features and a Supervised Learning Approach for Arabic Sentiment Analysis , 2016, CICLing.

[101]  Monji Kherallah,et al.  An Improved Arabic Handwritten Recognition System using Deep Support Vector Machines , 2016, Int. J. Multim. Data Eng. Manag..

[102]  Nizar Habash,et al.  A Large Scale Arabic Sentiment Lexicon for Arabic Opinion Mining , 2014, ANLP@EMNLP.

[103]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[104]  Aristides Gionis,et al.  Quantifying Controversy in Social Media , 2015, WSDM.

[105]  Christophe Garcia,et al.  ALIF: A dataset for Arabic embedded text recognition in TV broadcast , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[106]  Nizar Habash,et al.  A Sentiment Treebank and Morphologically Enriched Recursive Deep Models for Effective Sentiment Analysis in Arabic , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[107]  Ruhi Sarikaya,et al.  Arabic diacritic restoration approach based on maximum entropy models , 2009, Comput. Speech Lang..

[108]  Monji Kherallah,et al.  Feature Extractor Based Deep Method to Enhance Online Arabic Handwritten Recognition System , 2016, ICANN.

[109]  James R. Glass,et al.  Development of the MIT ASR system for the 2016 Arabic Multi-genre Broadcast Challenge , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[110]  Lixin Tao,et al.  Word embeddings for Arabic sentiment analysis , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[111]  Amir F. Atiya,et al.  LABR: A Large Scale Arabic Book Reviews Dataset , 2013, ACL.

[112]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[113]  Chokri Ben Amar,et al.  Dyadic Multi-resolution Analysis-Based Deep Learning for Arabic Handwritten Character Classification , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[114]  Mourad Elloumi,et al.  Arabic handwritten words off-line recognition based on HMMs and DBNs , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[115]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[116]  Majid A. Al-Taee,et al.  Automatic diacritization of Arabic text using recurrent neural networks , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[117]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[118]  Adel M. Alimi,et al.  A New Arabic Printed Text Image Database and Evaluation Protocols , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[119]  Hermann Ney,et al.  A Deep Learning Approach to Machine Transliteration , 2009, WMT@EACL.

[120]  Christophe Garcia,et al.  Deep learning and recurrent connectionist-based approaches for Arabic text recognition in videos , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[121]  Christopher Kermorvant,et al.  Dropout Improves Recurrent Neural Networks for Handwriting Recognition , 2013, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[122]  Ahmed Bouridane,et al.  Writer identification approach based on bag of words with OBI features , 2019, Inf. Process. Manag..

[123]  Christophe Garcia,et al.  Arabic text detection in videos using neural and boosting-based approaches: Application to video indexing , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[124]  Samhaa R. El-Beltagy,et al.  Building Large Arabic Multi-domain Resources for Sentiment Analysis , 2015, CICLing.

[125]  Mahmoud Al-Ayyoub,et al.  Deep Recurrent neural network vs. support vector machine for aspect-based sentiment analysis of Arabic hotels' reviews , 2017, J. Comput. Sci..

[126]  Yu Zhang,et al.  Recent advances in ASR applied to an Arabic transcription system for Al-Jazeera , 2014, INTERSPEECH.

[127]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[128]  Monji Kherallah,et al.  Offline Arabic Handwritten recognition system with dropout applied in Deep networks based-SVMs , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[129]  Leysia Palen,et al.  Identifying and Categorizing Disaster-Related Tweets , 2016, SocialNLP@EMNLP.

[130]  Mohsen Rashwan,et al.  Word Representations in Vector Space and their Applications for Arabic , 2015, CICLing.

[131]  Matthew England,et al.  Arabic language sentiment analysis on health services , 2017, 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR).

[132]  Dimosthenis Karatzas,et al.  Improving patch-based scene text script identification with ensembles of conjoined networks , 2016, Pattern Recognit..