Multi-objective evolution of oblique decision trees for imbalanced data binary classification

Abstract Imbalanced data classification is one of the most challenging problems in data mining. In this kind of problems, we have two types of classes: the majority class and the minority one. The former has a relatively high number of instances while the latter contains a much less number of instances. As most traditional classifiers usually assume that data is evenly distributed for all classes, they may considerably fail in recognizing instances in the minority class due to the imbalance problem. Several interesting approaches have been proposed to handle the class imbalance issue in the literature and the Oblique Decision Tree (ODT) is one of them. Nevertheless, most standard ODT construction algorithms use a greedy search process; while only very few works have addressed this induction problem using an evolutionary approach and this is done without really considering the class imbalance issue. To cope with this limitation, we propose in this paper a multi-objective evolutionary approach to find optimized ODTs for imbalanced binary classification. Our approach, called ODT-Θ-NSGA-III (ODT-based-Θ-Nondominated Sorting Genetic Algorithm-III), is motivated by its abilities: (a) to escape local optima in the ODT search space and (b) to maximize simultaneously both Precision and Recall. Thanks to these two features, ODT-Θ-NSGA-III provides competitive and better results when compared to many state-of-the-art classification algorithms on commonly used imbalanced benchmark data sets.

[1]  Petros Xanthopoulos,et al.  A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets , 2016, Expert Syst. Appl..

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  Ponnuthurai N. Suganthan,et al.  Enhancing Multi-Class Classification of Random Forest using Random Vector Functional Neural Network and Oblique Decision Surfaces , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[4]  María José del Jesús,et al.  A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets , 2008, Fuzzy Sets Syst..

[5]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[6]  Kenneth A. De Jong,et al.  Using genetic algorithms for concept learning , 1993, Machine Learning.

[7]  Qingfu Zhang,et al.  An Evolutionary Many-Objective Optimization Algorithm Based on Dominance and Decomposition , 2015, IEEE Transactions on Evolutionary Computation.

[8]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[9]  Lijun Xie,et al.  A Divide-and-Conquer Discretization Algorithm , 2005, FSKD.

[10]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[11]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[12]  Jason Matthews,et al.  Tour construction heuristics for an order sequencing problem , 2012 .

[13]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[14]  J. R. Quinlan,et al.  MDL and Categorical Theories (Continued) , 1995, ICML.

[15]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Decision-Tree Induction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[17]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[18]  Swagatam Das,et al.  Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs , 2015, Neural Networks.

[19]  Lukasz Kurgan,et al.  Discretization Algorithm that Uses Class-Attribute Interdependence Maximization , 2003 .

[20]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[21]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[22]  Francisco Herrera,et al.  Genetics-Based Machine Learning for Rule Induction: State of the Art, Taxonomy, and Comparative Study , 2010, IEEE Transactions on Evolutionary Computation.

[23]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[24]  Yaochu Jin,et al.  A Many-Objective Evolutionary Algorithm Using A One-by-One Selection Strategy , 2017, IEEE Transactions on Cybernetics.

[25]  Francisco Herrera,et al.  A Proposal of Evolutionary Prototype Selection for Class Imbalance Problems , 2006, IDEAL.

[26]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27]  Xizhao Wang,et al.  Voting-based instance selection from large data sets with MapReduce and random weight networks , 2016, Inf. Sci..

[28]  Jianyong Sun,et al.  A novel hybrid multi-objective artificial bee colony algorithm for blocking lot-streaming flow shop scheduling problems , 2018, Knowl. Based Syst..

[29]  Ujjwal Maulik,et al.  A Survey of Multiobjective Evolutionary Algorithms for Data Mining: Part I , 2014, IEEE Transactions on Evolutionary Computation.

[30]  Qingfu Zhang,et al.  Multiobjective evolutionary algorithms: A survey of the state of the art , 2011, Swarm Evol. Comput..

[31]  Francisco Herrera,et al.  Data Intrinsic Characteristics , 2018 .

[32]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[33]  Jing J. Liang,et al.  A survey on multi-objective evolutionary algorithms for the solution of the environmental/economic dispatch problems , 2018, Swarm Evol. Comput..

[34]  Gilles Venturini,et al.  SIA: A Supervised Inductive Algorithm with Genetic Search for Learning Attributes based Concepts , 1993, ECML.

[35]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[36]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[37]  Ponnuthurai N. Suganthan,et al.  Random Forests with ensemble of feature spaces , 2014, Pattern Recognit..

[38]  Xin Yao,et al.  A New Dominance Relation-Based Evolutionary Algorithm for Many-Objective Optimization , 2016, IEEE Transactions on Evolutionary Computation.

[39]  Ullrich Köthe,et al.  On Oblique Random Forests , 2011, ECML/PKDD.

[40]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[41]  Kalyanmoy Deb,et al.  An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints , 2014, IEEE Transactions on Evolutionary Computation.

[42]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[43]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[44]  Xiaoyan Sun,et al.  Many-objective evolutionary optimization based on reference points , 2017, Appl. Soft Comput..

[45]  Kun Jiang,et al.  A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE , 2016 .

[46]  Ester Bernadó-Mansilla,et al.  Accuracy-Based Learning Classifier Systems: Models, Analysis and Applications to Classification Tasks , 2003, Evolutionary Computation.

[47]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[48]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[49]  Dunwei Gong,et al.  A Set-Based Genetic Algorithm for Interval Many-Objective Optimization Problems , 2018, IEEE Transactions on Evolutionary Computation.

[50]  Kay Chen Tan,et al.  A coevolutionary algorithm for rules discovery in data mining , 2006, Int. J. Syst. Sci..

[51]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[52]  Jian Cheng,et al.  Multi-Objective Particle Swarm Optimization Approach for Cost-Based Feature Selection in Classification , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[53]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[54]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[55]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[56]  Vadlamani Ravi,et al.  A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance , 2015, Eng. Appl. Artif. Intell..

[57]  Jing Liu,et al.  An organizational coevolutionary algorithm for classification , 2006, IEEE Trans. Evol. Comput..

[58]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[59]  Ponnuthurai N. Suganthan,et al.  Towards generating random forests via extremely randomized trees , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[60]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[61]  Xin Yao,et al.  A novel automated approach for software effort estimation based on data augmentation , 2018, ESEC/SIGSOFT FSE.

[62]  Chandrika Kamath,et al.  Inducing oblique decision trees with evolutionary algorithms , 2003, IEEE Trans. Evol. Comput..

[63]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[64]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[65]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[66]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .

[67]  Arpit Singh,et al.  A Survey on Methods for Solving Data Imbalance Problem for Classification , 2015 .

[68]  Alex Alves Freitas,et al.  Automatic Design of Decision-Tree Algorithms with Evolutionary Algorithms , 2013, Evolutionary Computation.

[69]  William B. Langdon,et al.  Application of Genetic Programming to Induction of Linear Classification Trees , 2000, EuroGP.

[70]  Deborah R. Carvalho,et al.  A hybrid decision tree/genetic algorithm method for data mining , 2004, Inf. Sci..

[71]  Zbigniew Michalewicz,et al.  Parameter control in evolutionary algorithms , 1999, IEEE Trans. Evol. Comput..

[72]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[73]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[74]  Yue-Shi Lee,et al.  Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .

[75]  Szymon Wilk,et al.  Learning from Imbalanced Data in Presence of Noisy and Borderline Examples , 2010, RSCTC.

[76]  Ponnuthurai N. Suganthan,et al.  Oblique Decision Tree Ensemble via Multisurface Proximal Support Vector Machine , 2015, IEEE Transactions on Cybernetics.

[77]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[78]  Laetitia Vermeulen-Jourdan,et al.  Conception of a dominance-based multi-objective local search in the context of classification rule mining in large and imbalanced data sets , 2015, Appl. Soft Comput..

[79]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[80]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[81]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[82]  Rafael Rivera-López,et al.  A Global Search Approach for Inducing Oblique Decision Trees Using Differential Evolution , 2017, Canadian Conference on AI.

[83]  María José del Jesús,et al.  A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets , 2013, Knowl. Based Syst..

[84]  Marek Kretowski,et al.  Global Induction of Oblique Decision Trees: An Evolutionary Approach , 2005, Intelligent Information Systems.

[85]  J. Shaffer Modified Sequentially Rejective Multiple Test Procedures , 1986 .

[86]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[87]  Feng Jiang,et al.  A novel approach for discretization of continuous attributes in rough set theory , 2015, Knowl. Based Syst..

[88]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[89]  Jaume Bacardit,et al.  Evolving Multiple Discretizations with Adaptive Intervals for a Pittsburgh Rule-Based Learning Classifier System , 2003, GECCO.

[90]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[91]  Jian Cheng,et al.  Interval multi-objective quantum-inspired cultural algorithms , 2016, Neural Computing and Applications.

[92]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[93]  Bernhard Sendhoff,et al.  Evolving in silico bistable and oscillatory dynamics for gene regulatory network motifs , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).