A Hybrid Surrogate Model for Evolutionary Undersampling in Imbalanced Classification

Data preprocessing is a key stage in data mining that allows machine learning algorithms to obtain meaningful insights. Many preprocessing problems such as feature selection or instance selection can be modelled as optimisation/search problems. Evolutionary algorithms have traditionally excelled in this task when dealing with data of a moderate size. However, their application to large datasets typically involves very high computational costs. In this work, we propose a hybrid surrogate model for evolutionary undersampling in imbalanced classification problems. These are characterised by having a highly skewed distribution of classes in which evolutionary algorithms aim to balance the training data by selecting only the most relevant data. The proposed technique combines a two-stage clustering-based surrogate method with a windowing approach to quickly approximate fitness values of the chromosomes and accelerate the search. The experiments carried out in 44 standard imbalanced datasets show that the proposed hybrid surrogate model highly reduces the computational cost of the evolutionary algorithm without a considerable loss of performance.

[1]  Yaochu Jin,et al.  A comprehensive survey of fitness approximation in evolutionary computation , 2005, Soft Comput..

[2]  Francisco Herrera,et al.  Stratification for scaling up evolutionary prototype selection , 2005, Pattern Recognit. Lett..

[3]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[4]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[5]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[6]  María José del Jesús,et al.  KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining , 2017, Int. J. Comput. Intell. Syst..

[7]  Sang-Hoon Oh,et al.  Error back-propagation algorithm for classification of imbalanced data , 2011, Neurocomputing.

[8]  Ender Özcan,et al.  A review on the self and dual interactions between machine learning and optimisation , 2019, Progress in Artificial Intelligence.

[9]  Yaochu Jin,et al.  Surrogate-assisted evolutionary computation: Recent advances and future challenges , 2011, Swarm Evol. Comput..

[10]  Francisco Herrera,et al.  Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data , 2018, WIREs Data Mining Knowl. Discov..

[11]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[12]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[13]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.

[14]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[15]  Francisco Herrera,et al.  Evolutionary undersampling for imbalanced big data classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[16]  Thomas Bartz-Beielstein,et al.  Model-based methods for continuous and discrete global optimization , 2017, Appl. Soft Comput..

[17]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[18]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[21]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[22]  Martin V. Butz,et al.  Speeding-Up Pittsburgh Learning Classifier Systems: Modeling Time and Accuracy , 2004, PPSN.

[23]  Francisco Herrera,et al.  A first attempt on global evolutionary undersampling for imbalanced big data , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[24]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[25]  A. C. Martínez-Estudillo,et al.  Hybridization of evolutionary algorithms and local search by means of a clustering method , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[26]  Sung-Bae Cho,et al.  An efficient genetic algorithm with less fitness evaluation by clustering , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).