论文信息 - Geometric SMOTE: Effective oversampling for imbalanced learning through a geometric extension of SMOTE

Geometric SMOTE: Effective oversampling for imbalanced learning through a geometric extension of SMOTE

Classification of imbalanced datasets is a challenging task for standard algorithms. Although many methods exist to address this problem in different ways, generating artificial data for the minority class is a more general approach compared to algorithmic modifications. SMOTE algorithm and its variations generate synthetic samples along a line segment that joins minority class instances. In this paper we propose Geometric SMOTE (G-SMOTE) as a generalization of the SMOTE data generation mechanism. G-SMOTE generates synthetic samples in a geometric region of the input space, around each selected minority instance. While in the basic configuration this region is a hyper-sphere, G-SMOTE allows its deformation to a hyper-spheroid and finally to a line segment, emulating, in the last case, the SMOTE mechanism. The performance of G-SMOTE is compared against multiple standard oversampling algorithms. We present empirical results that show a significant improvement in the quality of the generated data when G-SMOTE is used as an oversampling algorithm.

Fernando Bação | Georgios Douzas | F. Bação | G. Douzas

[1] Nitesh V. Chawla,et al. Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[2] Fernando Nogueira,et al. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[3] Stephen Kwek,et al. Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[4] Xin Li,et al. Protein classification with imbalanced data , 2007, Proteins.

[5] Herna L. Viktor,et al. Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[6] Xin Yao,et al. Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[7] Taeho Jo,et al. Class imbalances versus small disjuncts , 2004, SKDD.

[8] Gustavo E. A. P. A. Batista,et al. A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[9] Pravin M. Vaidya,et al. AnO(n logn) algorithm for the all-nearest-neighbors Problem , 1989, Discret. Comput. Geom..

[10] Fernando Bação,et al. Effective data generation for imbalanced learning using conditional generative adversarial networks , 2018, Expert Syst. Appl..

[11] David A. Cieslak,et al. Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[12] Kai Ming Ting,et al. An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[13] Haibo He,et al. Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[14] Pedro M. Domingos. MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[15] Chumphol Bunkhumpornpat,et al. Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[16] Anirban DasGupta,et al. Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics , 2011 .

[17] Hui Han,et al. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[18] David A. Cieslak,et al. Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[19] Nitesh V. Chawla,et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[20] Iman Nekooeimehr,et al. Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets , 2016, Expert Syst. Appl..

[21] Bo Tang,et al. KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[22] Chumphol Bunkhumpornpat,et al. DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[23] MonardMaria Carolina,et al. A study of the behavior of several methods for balancing machine learning training data , 2004 .

[24] Haibo He,et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[25] Francisco Herrera,et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[26] Mikel Galar,et al. Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[27] Gregory Asner,et al. Tree Species Abundance Predictions in a Tropical Agricultural Landscape with a Supervised Classification Model and Imbalanced Data , 2016, Remote. Sens..

[28] Xin Yao,et al. MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[29] Dennis L. Wilson,et al. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[30] Rok Blagus,et al. SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[31] Bart Baesens,et al. New insights into churn prediction in the telecommunication sector: A profit driven data mining approach , 2012, Eur. J. Oper. Res..

[32] S. Clearwater,et al. A rule-learning program in high energy physics event classification , 1991 .

[33] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .

[34] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[35] P. McCullagh,et al. Generalized Linear Models , 1984 .

[36] Dimitris Kanellopoulos,et al. Handling imbalanced datasets: A review , 2006 .

[37] Yuming Zhou,et al. A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..