Geometric SMOTE: Effective oversampling for imbalanced learning through a geometric extension of SMOTE

Classification of imbalanced datasets is a challenging task for standard algorithms. Although many methods exist to address this problem in different ways, generating artificial data for the minority class is a more general approach compared to algorithmic modifications. SMOTE algorithm and its variations generate synthetic samples along a line segment that joins minority class instances. In this paper we propose Geometric SMOTE (G-SMOTE) as a generalization of the SMOTE data generation mechanism. G-SMOTE generates synthetic samples in a geometric region of the input space, around each selected minority instance. While in the basic configuration this region is a hyper-sphere, G-SMOTE allows its deformation to a hyper-spheroid and finally to a line segment, emulating, in the last case, the SMOTE mechanism. The performance of G-SMOTE is compared against multiple standard oversampling algorithms. We present empirical results that show a significant improvement in the quality of the generated data when G-SMOTE is used as an oversampling algorithm.

[1]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[2]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[3]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[4]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[5]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[6]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[7]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[8]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[9]  Pravin M. Vaidya,et al.  AnO(n logn) algorithm for the all-nearest-neighbors Problem , 1989, Discret. Comput. Geom..

[10]  Fernando Bação,et al.  Effective data generation for imbalanced learning using conditional generative adversarial networks , 2018, Expert Syst. Appl..

[11]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[12]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[13]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[14]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[15]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[16]  Anirban DasGupta,et al.  Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics , 2011 .

[17]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[18]  David A. Cieslak,et al.  Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[19]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[20]  Iman Nekooeimehr,et al.  Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets , 2016, Expert Syst. Appl..

[21]  Bo Tang,et al.  KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[22]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[23]  MonardMaria Carolina,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004 .

[24]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[25]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[26]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[27]  Gregory Asner,et al.  Tree Species Abundance Predictions in a Tropical Agricultural Landscape with a Supervised Classification Model and Imbalanced Data , 2016, Remote. Sens..

[28]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[29]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[30]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[31]  Bart Baesens,et al.  New insights into churn prediction in the telecommunication sector: A profit driven data mining approach , 2012, Eur. J. Oper. Res..

[32]  S. Clearwater,et al.  A rule-learning program in high energy physics event classification , 1991 .

[33]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[34]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[35]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[36]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[37]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..