CORE: core-based synthetic minority over-sampling and borderline majority under-sampling technique

Class imbalance learning has recently drawn considerable attention among researchers. In this area, a rare class is the class of primary interest from the aim of classification. Unfortunately, traditional machine learning algorithms fail to detect this class because a huge majority class overwhelms a tiny minority class. In this paper, we propose a new technique called CORE to handle the class imbalance problem. The objective of CORE is to strengthen the core of a minority class and weaken the risk of misclassified minority instances nearby the borderline of a majority class. These core and borderline regions are defined by the applicability of a safe level. As a result, a minority class is more crowed and dominant. The experiment shows that CORE can significantly improve the predictive performance of a minority class when its dataset is imbalance.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[3]  Tom Fawcett,et al.  Combining Data Mining and Machine Learning for Effective User Profiling , 1996, KDD.

[4]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[5]  Hitoshi Iba,et al.  Prediction of Cancer Class with Majority Voting Genetic Programming Classifier Using Gene Expression Data , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Gregg D. Wilensky,et al.  Neural Network Studies , 1993 .

[7]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[8]  Ahmet Sertbas,et al.  Selection of vocal features for Parkinson's Disease diagnosis , 2012, Int. J. Data Min. Bioinform..

[9]  Igor V. Tetko,et al.  Neural network studies, 1. Comparison of overfitting and overtraining , 1995, J. Chem. Inf. Comput. Sci..

[10]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[11]  Robert C. Holte,et al.  Severe Class Imbalance: Why Better Algorithms Aren't the Answer , 2005, ECML.

[12]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[13]  Taesung Park,et al.  Unbalanced sample size effect on the genome-wide population differentiation studies , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[14]  Xiaohua Hu,et al.  MAPLSC: A novel multi-class classifier for medical diagnosis , 2011, Int. J. Data Min. Bioinform..

[15]  Ian Witten,et al.  Data Mining , 2000 .

[16]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[17]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[18]  Evangelos E. Milios,et al.  Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[19]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[20]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[21]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[22]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[23]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[24]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[25]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[26]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[27]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[28]  Somnuk Phon-Amnuaisuk,et al.  A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection , 2010, Applied Intelligence.

[29]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[30]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[31]  Chumphol Bunkhumpornpat,et al.  MUTE: Majority under-sampling technique , 2011, 2011 8th International Conference on Information, Communications & Signal Processing.

[32]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[33]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[34]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .