Arabic Natural Language Processing from Software Engineering to Complex Pipeline

Arabic Natural Language Processing (ANLP) has known an important development during the last decade. Nowadays, Several ANLP tools are developed such as morphological analyzers, syntactic parsers, etc. These tools are characterized by their diversity in terms of development languages used, inputs/outputs manipulated, internal and external representations of results, etc. This is mainly due to the lack of models and standards that govern their implementations. This diversity does not favor interoperability between these tools or their reuse in new advanced projects. In this article, we propose APIs and models for three types of tools namely: stemmers, morphological analyzers and syntactic parsers, using SAFAR platform. Our proposal is a step for standardizing all aspects shared by tools of the same type. We review also the issue of interoperability between these tools. Finally, we discuss pipeline processes.

[1]  Guy Lapalme,et al.  Détection d’évènements à partir de Twitter [Event Detection in Tweets] , 2015, TAL.

[2]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[3]  A. BOUDLAL,et al.  A Morphosyntactic analysis system for Arabic texts , 2010 .

[4]  Ali Farghaly,et al.  Arabic computational linguistics , 2010 .

[5]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[6]  Tamás Váradi,et al.  Open source multi-platform NooJ for NLP , 2012, COLING.

[7]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[8]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[9]  Adam Kilgarriff,et al.  of the European Chapter of the Association for Computational Linguistics , 2006 .

[10]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[11]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[12]  Gülsen Eryigit,et al.  ITU Turkish NLP Web Service , 2014, EACL.

[13]  Guy Lapalme,et al.  Lakhas, an Arabic summarization system , 2004 .

[14]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[15]  Christopher D. Manning,et al.  Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.

[16]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[17]  K. Bretonnel Cohen,et al.  U-Compare: A modular NLP workflow construction and evaluation system , 2011, IBM J. Res. Dev..