Weirdness Coefficient as a Feature Selection Method for Arabic Special Domain Text Classification

Given the importance of organizing and managing the rapid growth in knowledge of Arabic electronic content, this study introduces the Weirdness Coefficient (W) as a new feature selection method for Arabic special domain text classification. The proposed method was used to classify a dataset comprising five Islamic topics using Naive base (NB) and K-nearest neighbor (K-NN) classifiers, and three representation schemas. The results were also compared with a well-known feature selection method, Chi-squared. In addition to its simplicity in computation, the Weirdness Coefficient showed promising classification accuracy.

[1]  Khurshid Ahmad,et al.  Choosing Feature Sets for Training and Testing Self-Organising Maps: A Case Study , 2001, Neural Computing & Applications.

[2]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[3]  Ghassan Kanaan,et al.  Text Feature Selection using Particle Swarm Optimization Algorithm , 2009 .

[4]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[5]  Riyad Al-Shalabi,et al.  A comparison of text-classification techniques applied to Arabic text , 2009, J. Assoc. Inf. Sci. Technol..

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  Abdulmohsen Al-Thubaity,et al.  KACST Arabic Text Classification Project: Overview and Preliminary Results , 2008 .

[8]  Abdelwadood Moh'd. Mesleh,et al.  Feature sub-set selection metrics for Arabic text classification , 2011, Pattern Recognit. Lett..

[9]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[10]  Heather Fulford,et al.  What is a term?: The semi-automatic extraction of terms from text , 1994 .

[11]  Alexander F. Gelbukh,et al.  Chi-Square Classifier for Document Categorization , 2001, CICLing.

[12]  Mohammed J. Bawaneh,et al.  Arabic Text Classification using K-NN and Naive Bayes , 2008 .

[13]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[14]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[15]  Khurshid Ahmad,et al.  Can Text Analysis Tell us Something about Technology Progress? , 2003, ACL 2003.

[16]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .