A novel robust Arabic light stemmer

Abstract The stemming is the process of transforming a word into its root or stem, hence, it is considered as a crucial pre-processing step before tackling any task of natural language processing or information retrieval. However, in the case of Arabic language, finding an effective stemming algorithm seems to be quite difficult, since the Arabic language has a specific morphology, which is different from many other languages. Although, there exist several algorithms in literature addressing the Arabic stemming issue, unfortunately, most of them are restricted to a limited number of words, present some confusions between original letters and affixes, and usually employ dictionary of words or patterns. For that purpose, we propose the design and implementation of a novel Arabic light stemmer, which is based on some new rules for stripping prefixes, suffixes and infixes in a smart way. And in our knowledge, it is the first work dealing with Arabic infixes with regards to their irregular rules. The empirical evaluation was conducted on a new Arabic data-set (called ARASTEM), which was conceived and collected from several Arabic discussion forums containing dialectical Arabic and modern pseudo-Arabic languages. Hence, we present a comparative investigation between our new stemmer and other existing stemmers using Paice’s parameters, namely: Under Stemming Index (UI), Over Stemming Index (OI) and Stemming Weight (SW). Results show that the proposed Arabic light stemmer maintains consistently high performances and outperforms several existing light stemmers.

[1]  Izzat Alsmadi,et al.  A novel root based Arabic stemmer , 2015, J. King Saud Univ. Comput. Inf. Sci..

[2]  Paul Lettinck SCIENCE IN ADAB LITERATURE , 2011, Arabic Sciences and Philosophy.

[3]  Azuraliza Abu Bakar,et al.  Soft Computing Applications and Intelligent Systems , 2013, Communications in Computer and Information Science.

[4]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[5]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[6]  Hanane Froud,et al.  A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications , 2012 .

[7]  Zainab Abu Bakar,et al.  A rule-based Arabic stemming algorithm , 2011 .

[8]  Ajith Abraham,et al.  Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011), Prague, Czech Republic, August, 2011 , 2013, IHCI.

[9]  Günter Neumann,et al.  Arabic Computational Morphology , 2007 .

[10]  Michael Fourman Algorithms, Software, Architecture - Information Processing '92, Volume 1, Proceedings of the IFIP 12th World Computer Congress, Madrid, Spain, 7-11 September 1992 , 1992 .

[11]  Jessica Lin,et al.  Towards an error-free Arabic stemming , 2008, iNEWS '08.

[12]  Chris D. Paice An evaluation method for stemming algorithms , 1994, SIGIR '94.

[13]  Václav Snásel,et al.  Simple Stemming Rules for Arabic Language , 2011, IHCI.

[14]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[15]  Ali Behloul,et al.  Implementation of a New Hybrid Method for Stemming of Arabic Text , 2012 .

[16]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[17]  Riyad Al-Shalabi,et al.  Building an effective rule-based light stemmer for Arabic language to inprove search effectiveness , 2008, 2008 International Conference on Innovations in Information Technology.

[18]  Farhad Oroumchian,et al.  Corpus-Based Arabic Stemming Using N-Grams , 2010, AIRS.

[19]  Jessica Lin,et al.  A novel Arabic lemmatization algorithm , 2008, AND '08.

[20]  Masnizah Mohd,et al.  Enhanced Arabic Information Retrieval: Light Stemming and Stop Words , 2013, M-CAIT.

[21]  Chris D. Paice Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[22]  Walid Cherif,et al.  Building a syntactic rules-based stemmer to improve search effectiveness for arabic language , 2014, 2014 9th International Conference on Intelligent Systems: Theories and Applications (SITA-14).

[23]  Karin C. Ryding,et al.  A Reference Grammar of Modern Standard Arabic , 2005 .

[24]  Yiming Yang,et al.  Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[25]  Martine Cuvalay-Haak The Verb in Literary and Colloquial Arabic , 1997 .