Diversity and separable metrics in over-sampling technique for imbalanced data classification

The imbalance data problem in classification is a significant research area and has attracted a lot attention in recent years. Rebalancing class distribution techniques such as over-sampling or under-sampling are the most common approaches to deal with this problem. This paper presents a new method so called Diversity and Separable Metrics in Over-Sampling Technique (DSMOTE) to handle the imbalanced learning problems. The main idea of the DSMOTE is to use a diversity and separable measure which shows a positive impact on the minority class. This improvement is achieved by reduce overfitting by using a diversity measure. Moreover by using the separable measure the risk of generating new samples in decision boundaries with hard-to-learn samples is decreased. The proposed method improves the learning accuracy in three stages including; (1) removal of abnormal samples from minority class, (2) selecting the top three samples of minority class according to desired criteria and (3) generating new sample using selected samples. The experiments are conducted on five real world datasets which is taken from Iran University of Medical Science and also six different UCI datasets. Moreover, three different classifiers, four resampling algorithms and six performance evaluation measures are used to evaluate the proposed method. The reported results indicate that the proposed approach demonstrates a better or at least comparable performance compared to those of the state-of-the-art methods.

[1]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[2]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[3]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[4]  Gary M. Weiss The Impact of Small Disjuncts on Classifier Learning , 2010, Data Mining.

[5]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[6]  Min Zhu,et al.  Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions , 2012, Comput. Biol. Chem..

[7]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[8]  Paul M. Thompson,et al.  Analysis of sampling techniques for imbalanced data: An n=648 ADNI study , 2014, NeuroImage.

[9]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[10]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[11]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[12]  Misha Denil,et al.  Overlap versus Imbalance , 2010, Canadian Conference on AI.

[13]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[14]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[15]  Gary M. Weiss Mining with Rare Cases , 2010, Data Mining and Knowledge Discovery Handbook.

[16]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[17]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[18]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[19]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[20]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[21]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[22]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[23]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[24]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[25]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[26]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[27]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[28]  Jorma Laurikkala,et al.  Instance-based data reduction for improved identification of difficult small classes , 2002, Intell. Data Anal..

[29]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[30]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[31]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.