A self‐adaptive synthetic over‐sampling technique for imbalanced classification

Traditionally, in supervised machine learning, (a significant) part of the available data (usually 50%‐80%) is used for training and the rest—for validation. In many problems, however, the data are highly imbalanced in regard to different classes or does not have good coverage of the feasible data space which, in turn, creates problems in validation and usage phase. In this paper, we propose a technique for synthesizing feasible and likely data to help balance the classes as well as to boost the performance in terms of confusion matrix as well as overall. The idea, in a nutshell, is to synthesize data samples in close vicinity to the actual data samples specifically for the less represented (minority) classes. This has also implications to the so‐called fairness of machine learning. In this paper, we propose a specific method for synthesizing data in a way to balance the classes and boost the performance, especially of the minority classes. It is generic and can be applied to different base algorithms, for example, support vector machines, k‐nearest neighbour classifiers deep neural, rule‐based classifiers, decision trees, and so forth. The results demonstrated that (a) a significantly more balanced (and fair) classification results can be achieved and (b) that the overall performance as well as the performance per class measured by confusion matrix can be boosted. In addition, this approach can be very valuable for the cases when the number of actual available labelled data is small which itself is one of the problems of the contemporary machine learning.

[1]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[2]  Xiaowei Gu,et al.  Local optimality of self-organising neuro-fuzzy inference systems , 2019, Inf. Sci..

[3]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[4]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[5]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[7]  Wen Yang,et al.  STRUCTURAL HIGH-RESOLUTION SATELLITE IMAGE INDEXING , 2010 .

[8]  Yaping Lin,et al.  Synthetic minority oversampling technique for multiclass imbalance problems , 2017, Pattern Recognit..

[9]  Francisco Herrera,et al.  IFROWANN: Imbalanced Fuzzy-Rough Ordered Weighted Average Nearest Neighbor Classification , 2015, IEEE Transactions on Fuzzy Systems.

[10]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[11]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[12]  Sarah Jane Delany k-Nearest Neighbour Classifiers , 2007 .

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[16]  Francisco Herrera,et al.  Chain based sampling for monotonic imbalanced classification , 2019, Inf. Sci..

[17]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[18]  Yong Hu,et al.  The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..

[19]  Jing Xia,et al.  Class Weights Random Forest Algorithm for Processing Class Imbalanced Medical Data , 2018, IEEE Access.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[22]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[23]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[24]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[25]  Iman Nekooeimehr,et al.  Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets , 2016, Expert Syst. Appl..

[26]  Plamen P. Angelov,et al.  A method for autonomous data partitioning , 2018, Inf. Sci..

[27]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[28]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[29]  Zhi Chen,et al.  A synthetic neighborhood generation based ensemble learning for the imbalanced data classification , 2017, Applied Intelligence.

[30]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[31]  Weihao Hu,et al.  Imbalance fault detection based on the integrated analysis strategy for variable-speed wind turbines , 2020 .

[32]  Plamen Angelov,et al.  Fair-by-design explainable models for prediction of recidivism , 2019, ArXiv.

[33]  Yue Xu,et al.  Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets , 2018, Inf. Sci..

[34]  Shawn D. Newsam,et al.  Bag-of-visual-words and spatial extensions for land-use classification , 2010, GIS '10.

[35]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[36]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[37]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.