The Power of Language Music: Arabic Lemmatization through Patterns

The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologists for centuries. While roots provide the consonantal building blocks, patterns provide the syllabic vocalic moulds. While roots provide abstract semantic classes, patterns realize these classes in specific instances. In this way both roots and patterns are indispensable for understanding the derivational, morphological and, to some extent, the cognitive aspects of the Arabic language. In this paper we perform lemmatization (a high-level lexical processing) without relying on a lookup dictionary. We use a hybrid approach that consists of a machine learning classifier to predict the lemma pattern for a given stem, and mapping rules to convert stems to their respective lemmas with the vocalization defined by the pattern.

[1]  Nasredine Semmar,et al.  Using Stemming in Morphological Analysis to Improve Arabic Information Retrieval , 2006, JEPTALNRECITAL.

[2]  Tarek El-Shishtawy,et al.  Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques , 2012, ArXiv.

[3]  Katharina Kann,et al.  Single-Model Encoder-Decoder with Explicit Morphological Representation for Reinflection , 2016, ACL.

[4]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[5]  Mona T. Diab Improved Arabic Base Phrase Chunking with a new enriched POS tag set , 2007, SEMITIC@ACL.

[6]  Dror Kamir,et al.  A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew , 2002, SEMITIC@ACL.

[7]  Tarek El-Shishtawy,et al.  A Lemma Based Evaluator for Semitic Language Text Summarization Systems , 2014, ArXiv.

[8]  O. Ozturkmenoglu,et al.  Comparison of different lemmatization approaches for information retrieval on Turkish text collection , 2012, 2012 International Symposium on Innovations in Intelligent Systems and Applications.

[9]  Azzeddine Mazroui,et al.  Hybrid approaches for automatic vowelization of Arabic texts , 2014, ArXiv.

[10]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[11]  Abdelsalam Abdelhamid Almarimi,et al.  Heuristic Lemmatization for Arabic Texts Indexation and Classification , 2010 .

[12]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[13]  Ahmed Guessoum,et al.  Restoration of Arabic Diacritics Using a Multilevel Statistical Model , 2015, CIIA.

[14]  Husni Al-Muhtaseb,et al.  Statistical Methods for Automatic diacritization of Arabic text , 2006 .

[15]  Tarek El-Shishtawy,et al.  An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes , 2012, ArXiv.

[16]  Lucie Skorkovská Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering , 2012, TSD.

[17]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[18]  Marko Tadić Croatian Lemmatization Server , 2005 .

[19]  Sherif Abdou,et al.  A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  A. Hossny,et al.  Automatic Morphological Rule Induction for Arabic , 2008 .

[21]  Mona T. Diab,et al.  Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .

[22]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[23]  Muhammad Abdul-Mageed,et al.  ASMA: A System for Automatic Segmentation and Morpho-Syntactic Disambiguation of Modern Standard Arabic , 2013, RANLP.

[24]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[25]  Josef van Genabith,et al.  Lemmatization and Lexicalized Statistical Parsing of Morphologically-Rich Languages: the Case of French , 2010, SPMRL@NAACL-HLT.

[26]  Josef van Genabith,et al.  A jellyfish dictionary for Arabic , 2013 .

[27]  Dunja Mladenic,et al.  A Rule based Approach to Word Lemmatization , 2004 .

[28]  J. Silva Shallow processing of portuguese: from sentence chunking to nominal lemmatization , 2007 .

[29]  Vimala Balakrishnan,et al.  Stemming and lemmatization: A comparison of retrieval performances , 2014 .

[30]  Jan Hajiÿc,et al.  Feature-Based Tagger of Approximations of Functional Arabic Morphology , 2005 .

[31]  Nada Lavra,et al.  LEARNING RIPPLE DOWN RULES FOR EFFICIENT LEMMATIZATION , 2007 .

[32]  Rigardt Pretorius,et al.  Automatic lemmatization in Setswana: towards a prototype , 2005 .