论文信息 - Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

We present a framework to address the imbalanced data problem using semi-supervised learning. Specifically, from a supervised problem, we create a semi-supervised problem and then use a semi-supervised learning method to identify the most relevant instances to establish a well-defined training set. We present extensive experimental results, which demonstrate that the proposed framework significantly outperforms all other sampling algorithms in 67% of the cases across three different classifiers and ranks second best for the remaining 33% of the cases.

Ioannis A. Kakadiaris | Bassam A. Almogahed | I. Kakadiaris | B. Almogahed

[1] Kurt Driessens,et al. Using Weighted Nearest Neighbor to Benefit from Unlabeled Data , 2006, PAKDD.

[2] Xiao-Ping Zhang,et al. Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I , 2005, ICIC.

[3] Lakhmi C. Jain,et al. Emerging Paradigms in Machine Learning , 2012 .

[4] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[5] Sang-Hoon Oh,et al. Error back-propagation algorithm for classification of imbalanced data , 2011, Neurocomputing.

[6] José Salvador Sánchez,et al. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[7] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8] Kihoon Yoon,et al. An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[9] Haibo He,et al. Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[10] Robert C. Holte,et al. Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[11] Hui Han,et al. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[12] Larry R. Medsker,et al. Hybrid Intelligent Systems , 1995, Springer US.

[13] Yue-Shi Lee,et al. Investigating the Effect of Sampling Methods for Imbalanced Data Distributions , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[14] Gustavo E. A. P. A. Batista,et al. A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[15] Stan Matwin,et al. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[16] I. Tomek,et al. Two Modifications of CNN , 1976 .

[17] D. J. Newman,et al. UCI Repository of Machine Learning Database , 1998 .

[18] Horst Bischof,et al. Semi-Supervised Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19] Haibo He,et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[20] Yue-Shi Lee,et al. Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .

[21] Gary M. Weiss. Mining with rarity: a unifying framework , 2004, SKDD.

[22] Charles Elkan,et al. The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[23] Bernhard Schölkopf,et al. Learning with Local and Global Consistency , 2003, NIPS.