Impact of Stemmer on Arabic Text Retrieval

Stemming is a process of reducing inflected words to their stem, stem or root from a generally written word form. One of the high inflected words in the languages world is Arabic Language. Stemming improve the retrieval performance by reducing words variants, and in lcrease the similarity between related words. However, an Arabic Information Retrieval (AIR) can use stemming algorithms to retrieve a greater number of documents related to the users’ query. Therefore, the aim of this paper is to evaluate the impact of three different Arabic stemmers (i.e. ‘Information Science Research Institute” (ISRI), morphological and syntax based lemmatization “Educated Text Stemmer” (ETS), and Light10 stemmer) on the Arabic Information Retrieval performance for Arabic language, we used the Linguistic Data Consortium (LDC) Arabic Newswire data set as benchmark dataset. The evaluation of the three different stemmers ranked the best performance was achieved by light10 stemmer in term of mean average precision.

[1]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[2]  Félix de Moya Anegón,et al.  Term conflation methods in information retrieval: Non‐linguistic and linguistic approaches , 2005 .

[3]  Ahmed A. Rafea,et al.  An accuracy-enhanced light stemmer for arabic text , 2011, TSLP.

[4]  Berkant Barla Cambazoglu,et al.  Review of "Search Engines: Information Retrieval in Practice" by Croft, Metzler and Strohman , 2010, Inf. Process. Manag..

[5]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[8]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[9]  Fredric C. Gey,et al.  Evaluating Arabic Retrieval from English or French Queries: The TREC-2001 Cross-Language Information Retrieval Track , 2001 .

[10]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[11]  Masnizah Mohd,et al.  Enhanced Arabic Information Retrieval: Light Stemming and Stop Words , 2013, M-CAIT.

[12]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[13]  Nazlia Omar,et al.  Arabic machine translation: a survey , 2012, Artificial Intelligence Review.

[14]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[15]  Martha W. Evens,et al.  Stemming methodologies over individual query words for an Arabic information retrieval system , 1999 .

[16]  David C. Blair The data-document distinction in information retrieval , 1984, CACM.

[17]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[18]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[19]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[20]  Abdullah M. AlShehri Optimization and effectiveness of n-grams approach for indexing and retrieval in Arabic information retrieval systems , 2002 .

[21]  Fernando Llopis,et al.  Passage Selection to Improve Question Answering , 2002, COLING 2002.

[22]  Chris D. Paice,et al.  Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[23]  Alexander M. Fraser,et al.  Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[24]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[25]  David C. Blair,et al.  The data-document distinction revisited , 2006, DATB.

[26]  Nizar Habash,et al.  Arabic Dialect Processing Tutorial , 2012, HLT-NAACL.

[27]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.