Adaptive Resampling with Active Learning

This paper proposes a novel algorithm Virtual Instances Resampling Technique Using Active Learning (VIRTUAL) for class imbalance problem in Support Vector Machine (SVM) learning. In supervised learning, prediction performance of the classification algorithms deteriorate when the training set is imbalanced. Class imbalance problem occurs when at least one of the classes are represented by substantially less number of instances than the others in the training set. Various real-world classification tasks, such as medical diagnosis and text categorization suffer from this phenomenon. VIRTUAL is a hybrid method of oversampling and active learning to form an adaptive technique for resampling of the minority class instances. Unlike traditional resampling methods which require preprocessing of the data, VIRTUAL generates virtual instances for the minority class support vectors during the training process, therefore it removes the need for an extra preprocessing stage. Our empirical results show that VIRTUAL outperforms other competitive oversampling techniques and active learning strategy in terms of prediction capability. In addition, VIRTUAL is more efficient in generating new instances and has a shorter training time than the other oversampling techniques due to its adaptive nature and its decision capability in creating virtual instances.

[1]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Nitesh V. Chawla,et al.  Classification and knowledge discovery in protein databases , 2004, J. Biomed. Informatics.

[4]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[5]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[6]  Ying Liu,et al.  Handling of imbalanced data in text classification: category-based term weights , 2007 .

[7]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[8]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[9]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[10]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[11]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[12]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[13]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.