On strategies for imbalanced text classification using SVM: A comparative study

Many real-world text classification tasks involve imbalanced training examples. The strategies proposed to address the imbalanced classification (e.g., resampling, instance weighting), however, have not been systematically evaluated in the text domain. In this paper, we conduct a comparative study on the effectiveness of these strategies in the context of imbalanced text classification using Support Vector Machines (SVM) classifier. SVM is the interest in this study for its good classification accuracy reported in many text classification tasks. We propose a taxonomy to organize all proposed strategies following the training and the test phases in text classification tasks. Based on the taxonomy, we survey the methods proposed to address the imbalanced classification. Among them, 10 commonly-used methods were evaluated in our experiments on three benchmark datasets, i.e., Reuters-21578, 20-Newsgroups, and WebKB. Using the area under the Precision-Recall Curve as the performance measure, our experimental results showed that the best decision surface was often learned by the standard SVM, not coupled with any of the proposed strategies. We believe such a negative finding will benefit both researchers and application developers in the area by focusing more on thresholding strategies.

[1]  Spiridon D. Likothanassis,et al.  Integrating feature and instance selection for text classification , 2002, KDD.

[2]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Marko Grobelnik,et al.  Training text classifiers with SVM on very few positive examples , 2003 .

[4]  Mahbub Hassan,et al.  FISA: Feature-Based Instance Selection for Imbalanced Text Classification , 2006, PAKDD.

[5]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[7]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[8]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[9]  José Ranilla,et al.  Introducing a family of linear measures for feature selection in text categorization , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[11]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[12]  Ulrich H.-G. Kreßel,et al.  Pairwise classification and support vector machines , 1999 .

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Kihoon Yoon,et al.  An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[16]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[17]  Jaideep Srivastava,et al.  Blocking reduction strategies in hierarchical text classification , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[19]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[20]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[22]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[23]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[24]  Hahn-Ming Lee,et al.  Multi-class SVM with negative data selection for Web page classification , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[25]  Hsinchun Chen,et al.  Automatic online news monitoring and classification for syndromic surveillance , 2009, Decision Support Systems.

[26]  Kam-Fai Wong,et al.  An intelligent information agent for document title classification and filtering in document-intensive domains , 2007, Decis. Support Syst..

[27]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[28]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[29]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[32]  James G. Shanahan,et al.  Boosting support vector machines for text classification through parameter-free threshold relaxation , 2003, CIKM '03.

[33]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[34]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[35]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[36]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[37]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[38]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[39]  PathakPraveen,et al.  An integrated two-stage model for intelligent information routing , 2006 .

[40]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.