Rational kernels for Arabic Root Extraction and Text Classification

In this paper, we address the problems of Arabic Text Classification and root extraction using transducers and rational kernels. We introduce a new root extraction approach on the basis of the use of Arabic patterns (Pattern Based Stemmer). Transducers are used to model these patterns and root extraction is done without relying on any dictionary. Using transducers for extracting roots, documents are transformed into finite state transducers. This document representation allows us to use and explore rational kernels as a framework for Arabic Text Classification. Root extraction experiments are conducted on three word collections and yield 75.6% of accuracy. Classification experiments are done on the Saudi Press Agency dataset and N-gram kernels are tested with different values of N. Accuracy and F1 report 90.79% and 62.93% respectively. These results show that our approach, when compared with other approaches, is promising specially in terms of accuracy and F1.

[1]  Abdulmohsen Al-Thubaity,et al.  KACST Arabic Text Classification Project: Overview and Preliminary Results , 2008 .

[2]  Ghassan Kanaan,et al.  A comparison of text-classification techniques applied to Arabic text , 2009 .

[3]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[4]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[5]  R. Al Shalabi,et al.  New approach for extracting Arabic roots , 2003 .

[6]  Izzat Alsmadi,et al.  A novel root based Arabic stemmer , 2015, J. King Saud Univ. Comput. Inf. Sci..

[7]  Riyad Al-Shalabi,et al.  Building an effective rule-based light stemmer for Arabic language to inprove search effectiveness , 2008, 2008 International Conference on Innovations in Information Technology.

[8]  Ismail Hmeidi,et al.  Extracting the roots of Arabic words without removing affixes , 2014, J. Inf. Sci..

[9]  Jean Berstel,et al.  Transductions and context-free languages , 1979, Teubner Studienbücher : Informatik.

[10]  Abdelmonaime Lachkar,et al.  Effective Arabic Stemmer Based Hybrid Approach for Arabic Text Categorization , 2013 .

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Mohanned Momani,et al.  A Novel Algorithm to Extract Tri-Literal Arabic Roots , 2007, 2007 IEEE/ACS International Conference on Computer Systems and Applications.

[13]  May Y. Al-Nashashibi,et al.  Stemming techniques for Arabic words: A comparative study , 2010, 2010 2nd International Conference on Computer Technology and Development.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Laila Khreisat,et al.  A machine learning approach for Arabic text classification using N-gram frequency statistics , 2009, J. Informetrics.

[16]  Haidar M. Harmanani,et al.  A Rule-Based Extensible Stemmer for Information Retrieval with Application to Arabic , 2006, Int. Arab J. Inf. Technol..

[17]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[18]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[19]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[20]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[21]  Abdelwadood Moh'd. Mesleh Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study , 2007, SCSS.

[22]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[23]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[24]  Tarek F. Gharib,et al.  Arabic Text Classification Using Support Vector Machines , 2009, Int. J. Comput. Their Appl..

[25]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[26]  Mehryar Mohri,et al.  Learning Languages with Rational Kernels , 2007, COLT.

[27]  A. Nehar,et al.  An efficient stemming for Arabic Text Classification , 2012, 2012 International Conference on Innovations in Information Technology (IIT).

[28]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[29]  Rehab Duwairi,et al.  Educative and Adaptive System for Personalized Learning: Learning Styles and Content Adaptation , 2007 .

[30]  Mahmoud Al-Ayyoub,et al.  Automatic Arabic text categorization: A comprehensive comparative study , 2015, J. Inf. Sci..

[31]  Sameh H. Ghwanmeh,et al.  Enhanced Algorithm for Extracting the Root of Arabic Words , 2009, 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization.