The impact of indexing approaches on Arabic text classification

This paper investigates the impact of using different indexing approaches (full-word, stem, and root) when classifying Arabic text. In this study, the naïve Bayes classifier is used to construct the multinomial classification models and is evaluated using stratified k-fold cross-validation (k ranges from 2 to 10). It is also uses a corpus that consists of 1000 normalized Arabic documents. The results of one experiment in this study show that significant accuracy improvements have occurred when the full-word form is used in most k-folds. Further experiments show that the classifier has achieved the highest accuracy in the eight-fold by using 7/8–1/8 train–test ratio, despite the indexing approach being used. The overall results of this study show that the classifier has achieved the maximum micro-average accuracy 99.36%, either by using the full-word form or the stem form. This proves that the stem is a better choice to use when classifying Arabic text, because it makes the corpus dataset smaller and this will enhance both the processing time and storage utilization, and achieve the highest level of accuracy.

[1]  Luis Alfonso Ureña López,et al.  Polarity classification for Spanish tweets using the COST corpus , 2015, J. Inf. Sci..

[2]  Carlos G. Figuerola,et al.  Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval , 2000, J. Inf. Sci..

[3]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[4]  Michael B. Spring,et al.  Ontology Mapping: As a Binary Classification Problem , 2008, 2008 Fourth International Conference on Semantics, Knowledge and Grid.

[5]  Rehab Duwairi,et al.  Arabic Text Categorization , 2007, Int. Arab J. Inf. Technol..

[6]  Eiman Tamah Al-Shammari A Novel Algorithm for Normalizing Noisy Arabic Text , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[7]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[8]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[9]  Izzat Alsmadi,et al.  The Effect of Stemming on Arabic Text Classification: An Empirical Study , 2011, Int. J. Inf. Retr. Res..

[10]  Marcelo Mendoza,et al.  A new term-weighting scheme for naïve Bayes text categorization , 2012, Int. J. Web Inf. Syst..

[11]  Joseph Dichy,et al.  An Empirical Study on the Feature's Type Effect on the Automatic Classification of Arabic Documents , 2010, CICLing.

[12]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[13]  Houssain Kettani,et al.  World Muslim Population : 1950 – 2020 , 2022 .

[14]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[15]  Atelach Alemu Argaw,et al.  Classifying Amharic webnews , 2008, Information Retrieval.

[16]  Rui Xia,et al.  Ensemble of feature sets and classification algorithms for sentiment classification , 2011, Inf. Sci..

[17]  Giovanni Soda,et al.  Hidden Markov Models for Text Categorization in Multi-Page Documents , 2002, Journal of Intelligent Information Systems.

[18]  Ali Salhi,et al.  Arabic Text Categorization Based on Arabic Wikipedia , 2014, ACM Trans. Asian Lang. Inf. Process..

[19]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[20]  Izzat Alsmadi,et al.  Content-based analysis to detect Arabic web spam , 2012, J. Inf. Sci..

[21]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[22]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[23]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[24]  Mario Vento,et al.  Automatic Indexing of News Videos Through Text Classification Techniques , 2005, ICAPR.

[25]  Wesam M. Ashour,et al.  Stemming Effectiveness in Clustering of Arabic Documents , 2012 .

[26]  Tong Zhang,et al.  A decision-tree-based symbolic rule induction system for text categorization , 2002, IBM Syst. J..

[27]  Florentino Fernández Riverola,et al.  Rough sets for spam filtering: Selecting appropriate decision rules for boundary e-mail classification , 2012, Appl. Soft Comput..

[28]  Levent Özgür,et al.  Analysis of Stemming Alternatives and Dependency Pattern Support in Text Classification , 2009 .

[29]  Rehab Duwairi,et al.  A study of the effects of preprocessing strategies on sentiment analysis for Arabic text , 2014, J. Inf. Sci..

[30]  Ali Ahmadi,et al.  Intelligent classification of web pages using contextual and visual features , 2011, Appl. Soft Comput..

[31]  R. Rajaram,et al.  Generating Best Features for Web Page Classification , 2008, Webology.

[32]  Alaa Eldin Fahmy,et al.  Histogram Clustering and Hybrid Classifier for Handwritten Arabic Characters Recognition , 2006, SPPRA.

[33]  Dennis McLeod,et al.  Spam Email Classification using an Adaptive Ontology , 2007, J. Softw..

[34]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[35]  Martin Braschler,et al.  How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[36]  Rehab Duwairi,et al.  Educative and Adaptive System for Personalized Learning: Learning Styles and Content Adaptation , 2007 .

[37]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[38]  Bei Yu,et al.  An evaluation of text classification methods for literary study , 2008, Lit. Linguistic Comput..

[39]  Sameh H. Ghwanmeh,et al.  Enhanced Algorithm for Extracting the Root of Arabic Words , 2009, 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization.

[40]  Rehab M. Duwairi Machine learning for Arabic text categorization: Research Articles , 2006 .

[41]  Ahmed Ech-Cherif,et al.  An arabic lemma-based stemmer for latent topic modeling , 2013, Int. Arab J. Inf. Technol..

[42]  Yu Shiwen,et al.  An adaptive k -nearest neighbor text categorization strategy , 2004 .

[43]  By Bei,et al.  An Evaluation of Text Classification Methods for Literary Study , 2022 .

[44]  Amer Al-Badarneh,et al.  A comparison study of some Arabic root finding algorithms , 2010, J. Assoc. Inf. Sci. Technol..

[45]  Mohammed N. Al-Kabi,et al.  A COMPARATIVE STUDY OF THE EFFICIENCY OF DIFFERENT MEASURES TO CLASSIFY ARABIC TEXT , 2007 .

[46]  Houssain Kettani,et al.  Muslim Population in the Americas: 1950 - 2020 , 2010 .

[47]  Víctor Robles,et al.  Feature selection for multi-label naive Bayes classification , 2009, Inf. Sci..

[48]  Shourya Roy,et al.  Fast and accurate text classification via multiple linear discriminant projections , 2003, The VLDB Journal.

[49]  Stefan Wermter,et al.  Neural Network Agents for Learning Semantic Text Classification , 2000, Information Retrieval.

[50]  Rehab Duwairi,et al.  Machine learning for Arabic text categorization , 2006, J. Assoc. Inf. Sci. Technol..

[51]  Athanasios Kehagias,et al.  A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms , 2003, Journal of Intelligent Information Systems.

[52]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[53]  Jan Zizka,et al.  Clustering a Very Large Number of Textual Unstructured Customers' Reviews in English , 2012, AIMSA.

[54]  Padraig Cunningham,et al.  An Assessment of Case-Based Reasoning for Spam Filtering , 2005, Artificial Intelligence Review.

[55]  Shigeaki Sakurai,et al.  An e-mail analysis method based on text mining techniques , 2005, Appl. Soft Comput..

[56]  King Abdullah,et al.  Knowledge Discovery in Al-Hadith Using Text Classification Algorithm , 2010 .

[57]  R. Duwairi,et al.  Stemming Versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization , 2007, 2007 Innovations in Information Technologies (IIT).

[58]  Christos Skourlas,et al.  A Weighted Maximum Entropy Language Model for Text Classification , 2016, NLUCS.