Integrating feature and instance selection for text classification

Instance selection and feature selection are two orthogonal methods for reducing the amount and complexity of data. Feature selection aims at the reduction of redundant features in a dataset whereas instance selection aims at the reduction of the number of instances. So far, these two methods have mostly been considered in isolation. In this paper, we present a new algorithm, which we call FIS (Feature and Instance Selection) that targets both problems simultaneously in the context of text classificationOur experiments on the Reuters and 20-Newsgroups datasets show that FIS considerably reduces both the number of features and the number of instances. The accuracy of a range of classifiers including Naïve Bayes, TAN and LB considerably improves when using the FIS preprocessed datasets, matching and exceeding that of Support Vector Machines, which is currently considered to be one of the best text classification methods. In all cases the results are much better compared to Mutual Information based feature selection. The training and classification speed of all classifiers is also greatly improved.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Yiming Yang Sampling Strategies and Learning Efficiency in Text Categorization , 1996 .

[3]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  T. Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1999, ECML.

[8]  Ellen Riloff,et al.  Little words can make a big difference for text classification , 1995, SIGIR '95.

[9]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[10]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[11]  Hongjun Lu,et al.  Scalable association-based text classification , 2000, CIKM '00.

[12]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[13]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[14]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[15]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[16]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[17]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.