EUSC: A clustering-based surrogate model to accelerate evolutionary undersampling in imbalanced classification

Abstract Learning from imbalanced datasets is highly demanded in real-world applications and a challenge for standard classifiers that tend to be biased towards the classes with the majority of the examples. Undersampling approaches reduce the size of the majority class to balance the class distributions. Evolutionary-based approaches are prominent, treating undersampling as a binary optimisation problem that determines which examples are removed. However, their utilisation is limited to small datasets due to fitness evaluation costs. This work proposes a two-stage clustering-based surrogate model that enables evolutionary undersampling to compute fitness values faster. The main novelty lies in the development of a surrogate model for binary optimisation which is based on the meaning (phenotype) rather than their binary representation (genotype). We conduct an evaluation on 44 imbalanced datasets, showing that in comparison with the original evolutionary undersampling, we can save up to 83% of the runtime without significantly deteriorating the classification performance.

[1]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[2]  Martin Pelikan,et al.  Fitness Inheritance in the Bayesian Optimization Algorithm , 2004, GECCO.

[3]  Sung-Bae Cho,et al.  An efficient genetic algorithm with less fitness evaluation by clustering , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[4]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[7]  Sang-Hoon Oh,et al.  Error back-propagation algorithm for classification of imbalanced data , 2011, Neurocomputing.

[8]  Lin Wang,et al.  Machine learning based mobile malware detection using highly imbalanced network traffic , 2017, Inf. Sci..

[9]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[10]  Francisco Herrera,et al.  Dynamic ensemble selection for multi-class imbalanced datasets , 2018, Inf. Sci..

[11]  D. Goldberg,et al.  Don't evaluate, inherit , 2001 .

[12]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[13]  Ender Özcan,et al.  A review on the self and dual interactions between machine learning and optimisation , 2019, Progress in Artificial Intelligence.

[14]  Mengjie Zhang,et al.  Surrogate-Assisted Evolutionary Deep Learning Using an End-to-End Random Forest-Based Performance Predictor , 2020, IEEE Transactions on Evolutionary Computation.

[15]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[16]  Bo Sun,et al.  Evolutionary under-sampling based bagging ensemble method for imbalanced data classification , 2018, Frontiers of Computer Science.

[17]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[18]  Matteo De Felice,et al.  Soft computing based optimization of combined cycled power plant start-up operation with fitness approximation methods , 2011, Appl. Soft Comput..

[19]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[20]  María José del Jesús,et al.  KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining , 2017, Int. J. Comput. Intell. Syst..

[21]  Robert E. Smith,et al.  Fitness inheritance in genetic algorithms , 1995, SAC '95.

[22]  Hugo Jair Escalante,et al.  Surrogate-assisted multi-objective model selection for support vector machines , 2015, Neurocomputing.

[23]  Francisco Herrera,et al.  Evolutionary undersampling for imbalanced big data classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[24]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[25]  Baoyong Zhang,et al.  State and parameter joint estimation of linear stochastic systems in presence of faults and non‐Gaussian noises , 2020, International Journal of Robust and Nonlinear Control.

[26]  Francisco Herrera,et al.  Learning from Imbalanced Data Sets , 2018, Springer International Publishing.

[27]  Xin Yao,et al.  Ensemble of Classifiers Based on Multiobjective Genetic Sampling for Imbalanced Data , 2020, IEEE Transactions on Knowledge and Data Engineering.

[28]  J. L. Hodges,et al.  Rank Methods for Combination of Independent Experiments in Analysis of Variance , 1962 .

[29]  Jianping Yin,et al.  Boosting weighted ELM for imbalanced learning , 2014, Neurocomputing.

[30]  Mehrdad Salami,et al.  A fast evaluation strategy for evolutionary algorithms , 2003, Appl. Soft Comput..

[31]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[32]  Bernhard Sendhoff,et al.  On Evolutionary Optimization with Approximate Fitness Functions , 2000, GECCO.

[33]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[34]  Peter Ross,et al.  Improving Evolutionary Timetabling with Delta Evaluation and Directed Mutation , 1994, PPSN.

[35]  Francisco Herrera,et al.  A first attempt on global evolutionary undersampling for imbalanced big data , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[36]  Thomas Bäck,et al.  Metamodel-assisted mixed integer evolution strategies and their application to intravascular ultrasound image analysis , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[37]  Yaochu Jin,et al.  Surrogate-assisted evolutionary computation: Recent advances and future challenges , 2011, Swarm Evol. Comput..

[38]  M. Dorigo,et al.  Ant colony optimization and local search for the probabilistic traveling salesman problem: a case study in stochastic combinatorial optimization , 2006 .

[39]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[40]  Handing Wang,et al.  Surrogate-Assisted Evolutionary Optimization of Large Problems , 2019, High-Performance Simulation-Based Optimization.

[41]  Fadi Thabtah,et al.  Data imbalance in classification: Experimental evaluation , 2020, Inf. Sci..

[42]  Petros Koumoutsakos,et al.  Accelerating evolutionary algorithms with Gaussian process fitness function models , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[43]  Francisco Herrera,et al.  Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data , 2018, WIREs Data Mining Knowl. Discov..

[44]  Martin V. Butz,et al.  Speeding-Up Pittsburgh Learning Classifier Systems: Modeling Time and Accuracy , 2004, PPSN.

[45]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[46]  Khaled Rasheed,et al.  A Survey of Fitness Approximation Methods Applied in Evolutionary Algorithms , 2010 .

[47]  Jonathan A. Wright,et al.  Constrained, mixed-integer and multi-objective optimisation of building designs by NSGA-II with fitness approximation , 2015, Appl. Soft Comput..

[48]  Peng Wang,et al.  An unsupervised fault diagnosis method for rolling bearing using STFT and generative neural networks , 2020, J. Frankl. Inst..

[49]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[50]  Francisco Herrera,et al.  Stratification for scaling up evolutionary prototype selection , 2005, Pattern Recognit. Lett..

[51]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[52]  José Salvador Sánchez,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[53]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[54]  Ahmed Kattan,et al.  Geometric Generalisation of Surrogate Model Based Optimisation to Combinatorial Spaces , 2011, EvoCOP.

[55]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[56]  Thomas Bartz-Beielstein,et al.  Model-based methods for continuous and discrete global optimization , 2017, Appl. Soft Comput..

[57]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[58]  Bernhard Sendhoff,et al.  Reducing Fitness Evaluations Using Clustering Techniques and Neural Network Ensembles , 2004, GECCO.

[59]  Hussein A. Abbass,et al.  Fitness inheritance for noisy evolutionary multi-objective optimization , 2005, GECCO '05.

[60]  Francisco Herrera,et al.  A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[61]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[63]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[64]  Bart Baesens,et al.  An empirical comparison of techniques for the class imbalance problem in churn prediction , 2017, Inf. Sci..

[65]  Francisco Herrera,et al.  Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability , 2010, Memetic Comput..

[66]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[67]  Yaochu Jin,et al.  A comprehensive survey of fitness approximation in evolutionary computation , 2005, Soft Comput..

[68]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[69]  A. C. Martínez-Estudillo,et al.  Hybridization of evolutionary algorithms and local search by means of a clustering method , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[70]  R Sundar,et al.  Performance enhanced Boosted SVM for Imbalanced datasets , 2019, Appl. Soft Comput..