Enhancing Arabic stemming process using resources and benchmarking tools

Many approaches and solutions have been proposed for developing Arabic light stemmers. These stemmers are often used in the context of application-oriented projects, especially when it comes to developing information retrieval (IR) systems. However, Arabic light stemming, as the process of stripping off a set of prefixes and/or suffixes, is a blinded task suffering from problems such as incorrect removal, vocalization ambiguity, single solution, etc. Moreover, each researcher claims that his/her stemmer reached a level of strength and accuracy quite high. However, in most cases, these stemmers are black boxes and it is not possible to access neither their source codes to verify their validity, nor the evaluation corpora that were used to claim such accuracy. Since these stemmers are very important for researchers, their comparison and evaluation is then essential to facilitate the choice of the stemmer to use in a given project. In this paper, we propose a new Arabic stemmer that gives solutions to the above mentioned drawbacks. In addition, we propose an automatic approach for the evaluation and comparison of Arabic stemmers that takes into account metrics related to the accuracy of results as well as the execution time of stemmers.

[1]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[2]  Younes Jaafar,et al.  Arabic Natural Language Processing from Software Engineering to Complex Pipeline , 2015, 2015 First International Conference on Arabic Computational Linguistics (ACLing).

[3]  Amer Al-Badarneh,et al.  A comparison study of some Arabic root finding algorithms , 2010 .

[4]  Markus Hofmann,et al.  RapidMiner: Data Mining Use Cases and Business Analytics Applications , 2013 .

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Mervat Gheith,et al.  An Enhanced Rule Based Arabic Morphological Analyzer Based on Proposed Assessment Criteria , 2015, ICSI.

[7]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[8]  Qasem A. Al-Radaideh,et al.  Benchmarking and assessing the performance of Arabic stemmers , 2011, J. Inf. Sci..

[9]  Yiming Yang,et al.  Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[10]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.

[11]  Félix de Moya Anegón,et al.  Term conflation methods in information retrieval: Non‐linguistic and linguistic approaches , 2005 .

[12]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[13]  Mohammed A. Otair COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMS , 2013 .

[14]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[15]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[16]  Nizar Habash,et al.  Morphological Annotation of Quranic Arabic , 2010, LREC.

[17]  Younes Jaafar,et al.  Improving Arabic morphological analyzers benchmark , 2016, Int. J. Speech Technol..

[18]  Carlos Alberto Heuser,et al.  Assessing the Impact of Stemming Accuracy on Information Retrieval , 2010, PROPOR.

[19]  Viviane Pereira Moreira,et al.  Assessing the impact of Stemming Accuracy on Information Retrieval - A multilingual perspective , 2016, Inf. Process. Manag..