Feature sub-set selection metrics for Arabic text classification

Feature sub-set selection (FSS) is an important step for effective text classification (TC) systems. This paper presents an empirical comparison of seventeen traditional FSS metrics for TC tasks. The TC is restricted to support vector machine (SVM) classifier and only for Arabic articles. Evaluation used a corpus that consists of 7842 documents independently classified into ten categories. The experimental results are presented in terms of macro-averaging precision, macro-averaging recall and macro-averaging F"1 measures. Results reveal that Chi-square and Fallout FSS metrics work best for Arabic TC tasks.

[1]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[4]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[5]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[6]  George Forman,et al.  BNS feature scaling: an improved representation over tf-idf for svm text classification , 2008, CIKM '08.

[7]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[8]  Tarek M. Sobh,et al.  Advances in Computer and Information Sciences and Engineering, Proceedings of the 2007 International Conference on Systems, Computing Sciences and Software Engineering (SCSS), part of the International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CISSE 2007), Br , 2008, SCSS.

[9]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[10]  Mohammed J. Bawaneh,et al.  Arabic Text Classification using K-NN and Naive Bayes , 2008 .

[11]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[12]  Abdelwadood Mesleh,et al.  Support Vector Machine Text Classifier for Arabic Articles , 2010 .

[13]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[14]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[15]  Teruko Mitamura,et al.  Arabic Morphology Generation Using a Concatenative Strategy , 2000, ANLP.

[16]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[17]  Abdelwadood Moh'd. Mesleh Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study , 2007, SCSS.

[18]  Michael S. Scordilis,et al.  Acoustic model and pronunciation adaptation in automatic speech recognition , 2006 .

[19]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[20]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[21]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[22]  Ah-Hwee Tan,et al.  On Machine Learning Methods for Chinese Document Categorization , 2003, Applied Intelligence.

[23]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[24]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[25]  Marie-Francine Moens,et al.  Information Extraction: Algorithms and Prospects in a Retrieval Context , 2006, The Information Retrieval Series.

[26]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  G. Kanaan,et al.  Support vector machine text classification system: Using Ant Colony Optimization based feature subset selection , 2008, 2008 International Conference on Computer Engineering & Systems.

[29]  Ismail Hmeidi,et al.  Performance of KNN and SVM classifiers on full word Arabic articles , 2008, Adv. Eng. Informatics.

[30]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[31]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[32]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[33]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[34]  Ghassan Kanaan,et al.  A comparison of text-classification techniques applied to Arabic text , 2009 .

[35]  Sargur N. Srihari,et al.  A feature selection framework for text filtering , 2003, Third IEEE International Conference on Data Mining.

[36]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[37]  John A. Kapenga,et al.  Computing in the 90's: The First Great Lakes Computer Science Conference, Kalamazoo Michigan, USA, October 18-20, 1989. Proceedings , 1991 .

[38]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[39]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[40]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[41]  Laila Khreisat,et al.  A machine learning approach for Arabic text classification using N-gram frequency statistics , 2009, J. Informetrics.

[42]  Yaxin Bi,et al.  An kNN Model-Based Approach and Its Application in Text Categorization , 2004, CICLing.

[43]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[44]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..