论文信息 - FISA: Feature-Based Instance Selection for Imbalanced Text Classification

FISA: Feature-Based Instance Selection for Imbalanced Text Classification

Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning time, methods based on FISA delivered much better classification accuracy than those methods using all negative training documents.

[1] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2] Adam Kowalczyk,et al. Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[3] Hahn-Ming Lee,et al. Multi-class SVM with negative data selection for Web page classification , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[4] Malik Yousef,et al. One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[5] Spiridon D. Likothanassis,et al. Integrating feature and instance selection for text classification , 2002, KDD.

[6] Edward Y. Chang,et al. KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[8] Marko Grobelnik,et al. Training text classifiers with SVM on very few positive examples , 2003 .

[9] Stan Matwin,et al. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[10] Fred Spiring,et al. Introduction to Statistical Quality Control , 2007, Technometrics.

[11] Huan Liu,et al. On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.