The Effect of Preprocessing on Arabic Document Categorization

Preprocessing is one of the main components in a conventional document categorization (DC) framework. This paper aims to highlight the effect of preprocessing tasks on the efficiency of the Arabic DC system. In this study, three classification techniques are used, namely, naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM). Experimental analysis on Arabic datasets reveals that preprocessing techniques have a significant impact on the classification accuracy, especially with complicated morphological structure of the Arabic language. Choosing appropriate combinations of preprocessing tasks provides significant improvement on the accuracy of document categorization depending on the feature size and classification techniques. Findings of this study show that the SVM technique has outperformed the KNN and NB techniques. The SVM technique achieved 96.74% micro-F1 value by using the combination of normalization and stemming as preprocessing tasks.

[1]  Izzat Alsmadi,et al.  The Effect of Stemming on Arabic Text Classification: An Empirical Study , 2011, Int. J. Inf. Retr. Res..

[2]  Fadi Thabtah,et al.  Naïve Bayesian Based on Chi Square to Categorize Arabic Data , 2009 .

[3]  Phayung Meesad,et al.  Developing an effective Thai Document Categorization Framework base on term relevance frequency weighting , 2010, 2010 Eighth International Conference on ICT and Knowledge Engineering.

[4]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[5]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[6]  Abdelwadood Moh'd. Mesleh Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study , 2007, SCSS.

[7]  Ghassan Kanaan,et al.  Text Feature Selection using Particle Swarm Optimization Algorithm , 2009 .

[8]  Mohammed J. Bawaneh,et al.  Arabic Text Classification using K-NN and Naive Bayes , 2008 .

[9]  Mounir Zrigui,et al.  Arabic Text Classification Framework Based on Latent Dirichlet Allocation , 2012, J. Comput. Inf. Technol..

[10]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[11]  Jessica Lin,et al.  Towards an error-free Arabic stemming , 2008, iNEWS '08.

[12]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  Rehab Duwairi,et al.  Arabic Text Categorization , 2007, Int. Arab J. Inf. Technol..

[15]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[16]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[17]  Bassam Al-Shargabi,et al.  A comparative study for Arabic text classification algorithms based on stop words elimination , 2011, ISWSA '11.

[18]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[19]  Radim Řehůřek,et al.  The Influence of Preprocessing Parameters on TextCategorization , 2007 .

[20]  Rehab Duwairi,et al.  Educative and Adaptive System for Personalized Learning: Learning Styles and Content Adaptation , 2007 .

[21]  Fadi A. Thabtah,et al.  Arabic Text Mining Using Rule Based Classification , 2012, J. Inf. Knowl. Manag..

[22]  Edward A. Fox,et al.  Automated arabic text classification with P‐Stemmer, machine learning, and a tailored news article taxonomy , 2016, J. Assoc. Inf. Sci. Technol..

[23]  Bassam Al-Shargabi,et al.  An Experimental Study for the Effect of Stop Words Elimination for Arabic Text Classification Algorithms , 2011, Int. J. Inf. Technol. Web Eng..

[24]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[25]  Juan M. Corchado,et al.  Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain , 2005, CAEPIA.

[26]  Alaa El-Halees,et al.  A Comparative Study on Arabic Text Classification , 2008, Egypt. Comput. Sci. J..

[27]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28]  A. Nehar,et al.  An efficient stemming for Arabic Text Classification , 2012, 2012 International Conference on Innovations in Information Technology (IIT).

[29]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[30]  Jafar Ababneh,et al.  Vector Space Models to Classify Arabic Text , 2014 .

[31]  Riyad Al-Shalabi,et al.  Building an effective rule-based light stemmer for Arabic language to inprove search effectiveness , 2008, 2008 International Conference on Innovations in Information Technology.

[32]  Mohammad S. Khorsheed,et al.  Comparative evaluation of text classification techniques using a large diverse Arabic dataset , 2013, Language Resources and Evaluation.

[33]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .

[34]  Falk Scholer,et al.  Capturing Out-of-Vocabulary Words in Arabic Text , 2006, EMNLP.

[35]  Michal Tomana,et al.  Influence of Word Normalization on Text Classification , 2007 .

[36]  P. Gács,et al.  Algorithms , 1992 .

[37]  Shuhai Liu,et al.  A comparative study on text representation schemes in text categorization , 2005, Pattern Analysis and Applications.

[38]  Driss Mammass,et al.  A Hybrid Method N-Grams-TFIDF with radial basis for indexing and classification of Arabic documents , 2014 .

[39]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[40]  Ismail Hmeidi,et al.  Performance of KNN and SVM classifiers on full word Arabic articles , 2008, Adv. Eng. Informatics.