A Comparative Survey on Arabic Stemming: Approaches and Challenges

Arabic, as one of the Semitic languages, has a very rich and complex morphology, which is radically different from the European and the East Asian languages. The derivational system of Arabic, is therefore, based on roots, which are often inflected to compose words, using a spectacular and a relatively large set of Arabic morphemes affixes, e.g., antefixs, prefixes, suffixes, etc. Stemming is the process of rendering all the inflected forms of word into a common canonical form. Stemming is one of the early and major phases in natural processing, machine translation and information retrieval tasks. A number of Arabic language stemmers were proposed. Examples include light stemming, morphological analysis, statistical-based stemming, N-grams and parallel corpora (collections). Motivated by the reported results in the literature, this paper attempts to exhaustively review current achievements for stemming Arabic texts. A variety of algorithms are discussed. The main contribution of the paper is to provide better understanding among existing approaches with the hope of building an error-free and effective Arabic stemmer in the near future.

[1]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[2]  Alexander M. Fraser,et al.  TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.

[3]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[4]  Alexander M. Fraser,et al.  Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[5]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[6]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[7]  Douglas W. Oard,et al.  CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval , 2002, TREC.

[8]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[9]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[10]  Turid Hedlund Dictionary-Based Cross-Language Information Retrieval , 2003 .

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Qasem A. Al-Radaideh,et al.  Using N-grams for Arabic text searching , 2004, J. Assoc. Inf. Sci. Technol..

[13]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[14]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[15]  Tim Buckwalter Issues in Arabic Orthography and Morphology Analysis , 2004 .

[16]  Massimo Poesio,et al.  Identifying Broken Plurals in Unvowelised Arabic Tex , 2004, EMNLP.

[17]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[18]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[19]  Falk Scholer,et al.  Stemming Arabic Conjunctions and Prepositions , 2005, SPIRE.

[20]  Jian-Yun Nie,et al.  Effective Stemming for Arabic Information Retrieval , 2006, BCS.

[21]  A. Ayesh,et al.  A Triliteral Word Roots Extraction Using Neural Network For Arabic , 2006, 2006 International Conference on Computer Engineering and Systems.

[22]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[23]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[24]  Margaret E. Connell,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[25]  Mohammed Attia,et al.  Arabic Tokenization System , 2007, SEMITIC@ACL.

[26]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[27]  Jessica Lin,et al.  Towards an error-free Arabic stemming , 2008, iNEWS '08.

[28]  Nashat Mansour,et al.  An auto-indexing method for Arabic text , 2008, Inf. Process. Manag..

[29]  Wingyan Chung,et al.  Web searching in a multilingual world , 2008, CACM.

[30]  Bassam H. Hammo Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents , 2008, Information Retrieval.

[31]  Sameh H. Ghwanmeh,et al.  Enhanced Algorithm for Extracting the Root of Arabic Words , 2009, 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization.

[32]  Ismail Hmeidi,et al.  A novel approach to the extraction of roots from Arabic words using bigrams , 2010, J. Assoc. Inf. Sci. Technol..

[33]  Boumediene Belkhouche,et al.  GENESTEM: A novel approach for an Arabic stemmer using genetic algorithms , 2011, 2011 International Conference on Innovations in Information Technology.

[34]  Hussein Suleman,et al.  Building a Multilingual and Mixed Arabic-English Corpus , 2011 .

[35]  Ahmed Ibraheem J Shagalieh Building an Effective Stemmer for Arabic Language to Improve Search Effectiveness , 2014 .

[36]  Nizar Habash,et al.  ADAM: Analyzer for Dialectal Arabic Morphology , 2014, J. King Saud Univ. Comput. Inf. Sci..

[37]  Belal Abu Ata,et al.  A rule-based stemmer for Arabic Gulf dialect , 2015, J. King Saud Univ. Comput. Inf. Sci..

[38]  Izzat Alsmadi,et al.  A novel root based Arabic stemmer , 2015, J. King Saud Univ. Comput. Inf. Sci..

[39]  Hussein Suleman,et al.  Mixed Language Arabic-English Information Retrieval , 2015, CICLing.

[40]  Rafal Ali Sameer Modified Light Stemming Algorithm for Arabic Language , 2016 .