Comparison of text feature selection policies and using an adaptive framework

Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, often feature selection is used for reducing the dimensionality. In this paper, we make an evaluation and comparison of the feature selection policies used in text categorization by employing some of the popular feature selection metrics. For the experiments, we use datasets which vary in size, complexity, and skewness. We use support vector machine as the classifier and tf-idf weighting for weighting the terms. In addition to the evaluation of the policies, we propose new feature selection metrics which show high success rates especially with low number of keywords. These metrics are two-sided local metrics and are based on the difference of the distributions of a term in the documents belonging to a class and in the documents not belonging to that class. Moreover, we propose a keyword selection framework called adaptive keyword selection. It is based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes.

[1]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[2]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[3]  Douglas W. Oard,et al.  Combining feature selectors for text classification , 2006, CIKM '06.

[4]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[5]  Nicolai Petkov,et al.  Comparison of texture features based on Gabor filters , 2002, IEEE Trans. Image Process..

[6]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[7]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[8]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  Xiang-Yan Zeng,et al.  Multi-class feature selection for texture classification , 2006, Pattern Recognit. Lett..

[10]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[13]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[15]  Zenglin Xu,et al.  Discriminative Semi-Supervised Feature Selection Via Manifold Regularization , 2009, IEEE Transactions on Neural Networks.

[16]  Hiroshi Ogura,et al.  Comparison of metrics for feature selection in imbalanced text classification , 2011, Expert Syst. Appl..

[17]  R. Srihari,et al.  Optimally Combining Positive and Negative Features for Text Categorization , 2003 .

[19]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[20]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[21]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[22]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[23]  Bernhard Schölkopf,et al.  Remote Sensing Feature Selection by Kernel Dependence Measures , 2010, IEEE Geoscience and Remote Sensing Letters.

[24]  Wongkot Sriurai,et al.  IMPROVING TEXT CATEGORIZATION BY USING A TOPIC MODEL , 2011 .

[25]  Robert Neumayer,et al.  Combination of Feature Selection Methods for Text Categorisation , 2011, ECIR.

[26]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[27]  Narayanan Kulathuramaiyer,et al.  An Empirical Study of Feature Selection for Text Categorization based on Term Weightage , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[28]  Tunga Güngör,et al.  Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords , 2006, KI.

[29]  Timothy A. Gonsalves,et al.  Feature Selection for Text Classification Based on Gini Coefficient of Inequality , 2010, FSDM.

[30]  Anirban Dasgupta,et al.  Feature selection methods for text classification , 2007, KDD '07.

[31]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[32]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[33]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[34]  Chu-Ren Huang,et al.  A Framework of Feature Selection Methods for Text Categorization , 2009, ACL.

[35]  Mohamed S. Kamel,et al.  Higher order feature selection for text classification , 2006 .

[36]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[37]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[38]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[39]  George D. C. Cavalcanti,et al.  A global-ranking local feature selection method for text categorization , 2012, Expert Syst. Appl..