Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

We present a framework to address the imbalanced data problem using semi-supervised learning. Specifically, from a supervised problem, we create a semi-supervised problem and then use a semi-supervised learning method to identify the most relevant instances to establish a well-defined training set. We present extensive experimental results, which demonstrate that the proposed framework significantly outperforms all other sampling algorithms in 67% of the cases across three different classifiers and ranks second best for the remaining 33% of the cases.

[1]  Kurt Driessens,et al.  Using Weighted Nearest Neighbor to Benefit from Unlabeled Data , 2006, PAKDD.

[2]  Xiao-Ping Zhang,et al.  Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I , 2005, ICIC.

[3]  Lakhmi C. Jain,et al.  Emerging Paradigms in Machine Learning , 2012 .

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Sang-Hoon Oh,et al.  Error back-propagation algorithm for classification of imbalanced data , 2011, Neurocomputing.

[6]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[7]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8]  Kihoon Yoon,et al.  An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[9]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[10]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[11]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[12]  Larry R. Medsker,et al.  Hybrid Intelligent Systems , 1995, Springer US.

[13]  Yue-Shi Lee,et al.  Investigating the Effect of Sampling Methods for Imbalanced Data Distributions , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[14]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[15]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[16]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[17]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[18]  Horst Bischof,et al.  Semi-Supervised Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[20]  Yue-Shi Lee,et al.  Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .

[21]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[22]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[23]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.