CGMOS: Certainty Guided Minority OverSampling

Handling imbalanced datasets is a challenging problem that if not treated correctly results in reduced classification performance. Imbalanced datasets are commonly handled using minority oversampling, whereas the SMOTE algorithm is a successful oversampling algorithm with numerous extensions. SMOTE extensions do not have a theoretical guarantee during training to work better than SMOTE and in many instances their performance is data dependent. In this paper we propose a novel extension to the SMOTE algorithm with a theoretical guarantee for improved classification performance. The proposed approach considers the classification performance of both the majority and minority classes. In the proposed approach CGMOS (Certainty Guided Minority OverSampling) new data points are added by considering certainty changes in the dataset. The paper provides a proof that the proposed algorithm is guaranteed to work better than SMOTE for training data. Further, experimental results on 30 real-world datasets show that CGMOS works better than existing algorithms when using 6 different classifiers.

[1]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[2]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[3]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[4]  Eric R. Ziegel,et al.  Probability and Statistics for Engineering and the Sciences , 2004, Technometrics.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[7]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[8]  Manali Sharma,et al.  Most-Surely vs. Least-Surely Uncertain , 2013, 2013 IEEE 13th International Conference on Data Mining.

[9]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[10]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[11]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[12]  D. W. Zimmerman Teacher’s Corner: A Note on Interpretation of the Paired-Samples t Test , 1997 .

[13]  L. Davis,et al.  Background and foreground modeling using nonparametric kernel density estimation for visual surveillance , 2002, Proc. IEEE.

[14]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[15]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[16]  Steve McKillup,et al.  Statistics Explained: An Introductory Guide for Life Scientists , 2006 .

[17]  Rob J. Hyndman,et al.  A Bayesian approach to bandwidth selection for multivariate kernel density estimation , 2006, Comput. Stat. Data Anal..

[18]  Kazuyuki Murase,et al.  A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning , 2011, ICONIP.

[19]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[20]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[21]  Mehryar Mohri,et al.  Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.

[22]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[23]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[24]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[25]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[26]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[27]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[28]  James Parker,et al.  on Knowledge and Data Engineering, , 1990 .

[29]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[30]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[31]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[32]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).