Feature Selection and Reduction for Persian Text Classification

the rapid growth of the World Wide Web and increasing availability of electronic documents, the automatic text classification became a general and important machine learning problem in text mining domain. In text classification, feature selection is used for reducing the size of feature vector and for improving the performance of classifier. This paper improved Dominance which is a feature selection criterion and proposed Extended Dominance (E-Dominance) as a new criterion. E-Dominance is compared favorably with usual feature selection methods based on document frequency (DF), information gain (IG), Entropy, χ2 and Dominance on a collection of XML documents from Hamshahri2 which is a commonly used in Persian text classification. The comparative study confirms the effectiveness of proposed feature selection criterion derived from the Dominance.

[1]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[2]  Vincent Tam,et al.  A Comparative Study of Centroid-Based, Neighborhood-Based and Statistical Approaches for Effective Document Categorization , 2002, ICPR.

[3]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[4]  Christophe Moulin,et al.  Entropy based feature selection for text categorization , 2011, SAC.

[5]  Huan Liu Feature Selection , 2010, Encyclopedia of Machine Learning.

[6]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[9]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[10]  G. R. Dunlop A rapid computational method for improvements to nearest neighbour interpolation , 1980 .

[11]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[12]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[13]  Bong Chih How,et al.  An Examination of Feature Selection Frameworks in Text Categorization , 2005, AIRS.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.