Contextual text categorization: an improved stemming algorithm to increase the quality of categorization in arabic text

One of the methods used to reduce the size of terms vocabulary in Arabic text categorization is to replace the different variants (forms) of words by their common root. This process is called stemming based on the extraction of the root. Therefore, the search of the root in Arabic or Arabic word root extraction is more difficult than in other languages since the Arabic language has a very different and difficult structure, that is because it is a very rich language with complex morphology. Many algorithms are proposed in this field. Some of them are based on morphological rules and grammatical patterns, thus they are quite difficult and require deep linguistic knowledge. Others are statistical, so they are less difficult and based only on some calculations. In this paper we propose an improved stemming algorithm based on the extraction of the root and the technique of n-grams which permit to return Arabic words’ stems without using any morphological rules or grammatical patterns.

[1]  Ashraf Odeh,et al.  An Improved Arabic Word's roots Extraction method using n-Gram Technique , 2014, J. Comput. Sci..

[2]  Fatma Abu Hawas Towards a new Approach for Arabic root extraction: Exploit relations between the word letters and their placement in the word for Arabic root extraction , 2013, Comput. Sci..

[3]  R. Al Shalabi,et al.  New approach for extracting Arabic roots , 2003 .

[4]  Nidal Yousef,et al.  Evaluation of Different Query Expansion Techniques by using Different Similarity Measures in Arabic Documents , 2013 .

[5]  Mahmoud Gaafar,et al.  Arabic verbs and essentials of grammar , 1997 .

[6]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[7]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.

[8]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[9]  Keith E. Emmert,et al.  Rule-based Approach for Arabic Root Extraction: New Rules to Directly Extract Roots of Arabic Words , 2014, J. Comput. Inf. Technol..

[10]  Ahmed Ibraheem J Shagalieh Building an Effective Stemmer for Arabic Language to Improve Search Effectiveness , 2014 .

[11]  Riyad Al-Shalabi Pattern-based Stemmer for Finding Arabic Roots , 2005 .

[12]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[13]  Rehab Duwairi,et al.  Arabic Text Categorization , 2007, Int. Arab J. Inf. Technol..

[14]  Mohanned Momani,et al.  A Novel Algorithm to Extract Tri-Literal Arabic Roots , 2007, 2007 IEEE/ACS International Conference on Computer Systems and Applications.

[15]  Azzeddine Mazroui,et al.  A Markovian approach for arabic root extraction , 2011, Int. Arab J. Inf. Technol..

[16]  Mohammad Hajjar,et al.  A System for Evaluation of Arabic Root Extraction Methods , 2010, 2010 Fifth International Conference on Internet and Web Applications and Services.

[17]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[18]  May Y. Al-Nashashibi,et al.  Stemming techniques for Arabic words: A comparative study , 2010, 2010 2nd International Conference on Computer Technology and Development.

[19]  R. Al-Shalabi,et al.  Stemmer Algorithm for Arabic Words Based on Excessive Letter Locations , 2007, 2007 Innovations in Information Technologies (IIT).

[20]  May Y. Al-Nashashibi,et al.  An improved root extraction technique for Arabic words , 2010, 2010 2nd International Conference on Computer Technology and Development.

[21]  Ismail Hmeidi,et al.  A novel approach to the extraction of roots from Arabic words using bigrams , 2010, J. Assoc. Inf. Sci. Technol..

[22]  Sameh H. Ghwanmeh,et al.  Enhanced Algorithm for Extracting the Root of Arabic Words , 2009, 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization.