Developing Interval-Based Cost-Sensitive Classifiers by Genetic Programming for Binary High-Dimensional Unbalanced Classification [Research Frontier]

Cost-sensitive learning is a popular approach to addressing the problem of class imbalance for many classification algorithms in machine learning. However, most cost-sensitive algorithms are dependent on manually designed cost matrices. Unfortunately, in many cases, it is often not easy for humans, even experts, to accurately specify misclassification costs for different mistakes due to the lack of domain knowledge related to actual situations in some complex unbalanced problems. As a result, these cost-sensitive algorithms cannot be directly applied. This paper proposes a new genetic programmingbased approach to developing cost-sensitive classifiers that are independent of manually designed cost matrices. The proposed method is able to construct classifiers and learn cost intervals automatically and simultaneously. In the proposed method, a tree representation, terminal sets and function sets are designed and developed. We examine the effectiveness of the proposed method on ten high-dimensional unbalanced datasets. The experimental results show that the proposed method often outperforms compared methods for highdimensional unbalanced classification. Furthermore, according to the analysis of evolved trees, the constructed classifiers often only need a small number of features to achieve a good classification performance.

[1]  Zhi-Hua Zhou,et al.  Cost-Sensitive Face Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[3]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[5]  David Zhang,et al.  Evolutionary Cost-Sensitive Extreme Learning Machine , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[7]  Mengjie Zhang,et al.  Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data , 2017, EuroGP.

[8]  Mark Johnston,et al.  Developing New Fitness Functions in Genetic Programming for Classification With Unbalanced Data , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Dazhe Zhao,et al.  An Optimized Cost-Sensitive SVM for Imbalanced Data Learning , 2013, PAKDD.

[10]  Chang-Dong Wang,et al.  Spectral Clustering by Subspace Randomization and Graph Fusion for High-Dimensional Data , 2020, PAKDD.

[11]  Zhi-Hua Zhou,et al.  ON MULTI‐CLASS COST‐SENSITIVE LEARNING , 2006, Comput. Intell..

[12]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[13]  Zhi-Hua Zhou,et al.  Learning with cost intervals , 2010, KDD '10.

[14]  Zhen Fang,et al.  Mortality prediction based on imbalanced high-dimensional ICU big data , 2018, Comput. Ind..

[15]  Zhi-Hua Zhou,et al.  Towards Cost-Sensitive Learning for Real-World Applications , 2011, PAKDD Workshops.

[16]  Mengjie Zhang,et al.  A New Representation in PSO for Discretisation-Based Feature Selection , 2017 .

[17]  Jun Ni,et al.  An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Amir Hussain,et al.  Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study , 2016, IEEE Access.

[19]  Jerzy Stefanowski,et al.  Dealing with Data Difficulty Factors While Learning from Imbalanced Data , 2016, Challenges in Computational Statistics and Data Mining.

[20]  OngYew-Soon,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007 .

[21]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[22]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[23]  Xianzhong Zhou,et al.  Cost-sensitive dual-bidirectional linear discriminant analysis , 2020, Inf. Sci..

[24]  Björn E. Ottersten,et al.  Example-dependent cost-sensitive decision trees , 2015, Expert Syst. Appl..

[25]  Mengjie Zhang,et al.  New Fitness Functions in Genetic Programming for Classification with High-dimensional Unbalanced Data , 2019, 2019 IEEE Congress on Evolutionary Computation (CEC).

[26]  Mark Johnston,et al.  Evolving ensembles in multi-objective genetic programming for classification with unbalanced data , 2011, GECCO '11.

[27]  Jun Zhang,et al.  Adaptive Semi-Supervised Classifier Ensemble for High Dimensional Data Classification , 2019, IEEE Transactions on Cybernetics.

[28]  Isabelle Guyon,et al.  Taking Human out of Learning Applications: A Survey on Automated Machine Learning , 2018, 1810.13306.

[29]  Ke Tang,et al.  Feature Selection for Maximizing the Area Under the ROC Curve , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[30]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[31]  Michael C. Mozer,et al.  Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic , 2003, ICML.

[32]  Mark Johnston,et al.  Reusing Genetic Programming for Ensemble Selection in Classification of Unbalanced Data , 2014, IEEE Transactions on Evolutionary Computation.

[33]  Zoe L. Jiang,et al.  Feature selection for high dimensional imbalanced class data based on F-measure optimization , 2017, 2017 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC).

[34]  Mengjie Zhang,et al.  Genetic programming for feature construction and selection in classification on high-dimensional data , 2016, Memetic Comput..

[35]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[36]  Ka-chun Wong,et al.  Nature-Inspired Multiobjective Cancer Subtype Diagnosis , 2019, IEEE Journal of Translational Engineering in Health and Medicine.

[37]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[38]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[39]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[40]  Nuno Vasconcelos,et al.  Cost-Sensitive Support Vector Machines , 2012, Neurocomputing.

[41]  Hong Zhao,et al.  A cost sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism , 2017, Inf. Sci..

[42]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[43]  Bin Gu,et al.  Bi-Parameter Space Partition for Cost-Sensitive SVM , 2015, IJCAI.

[44]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[45]  Mengjie Zhang,et al.  Variable-Length Particle Swarm Optimization for Feature Selection on High-Dimensional Classification , 2019, IEEE Transactions on Evolutionary Computation.

[46]  Jane You,et al.  Progressive subspace ensemble learning , 2016, Pattern Recognit..

[47]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[48]  Victor S. Sheng,et al.  Cost-Sensitive Learning , 2009, Encyclopedia of Data Warehousing and Mining.

[49]  Mark Johnston,et al.  Evolving Diverse Ensembles Using Genetic Programming for Classification With Unbalanced Data , 2013, IEEE Transactions on Evolutionary Computation.

[50]  Mihrimah Özmen,et al.  CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems , 2019, Neural Computing and Applications.

[51]  María José del Jesús,et al.  A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets , 2013, Knowl. Based Syst..

[52]  Aaron Klein,et al.  Auto-sklearn: Efficient and Robust Automated Machine Learning , 2019, Automated Machine Learning.

[53]  M. Mostafizur Rahman,et al.  Cluster Based Under-Sampling for Unbalanced Cardiovascular Data , 2013 .

[54]  Jun Zhang,et al.  A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification , 2019, Int. J. Bioinform. Res. Appl..

[55]  Xin Yao,et al.  Cost-sensitive classification with genetic programming , 2005, 2005 IEEE Congress on Evolutionary Computation.

[56]  Witold Pedrycz,et al.  Flexibility Degree of Fuzzy Numbers and its Implication to a Group-Decision-Making Model , 2019, IEEE Transactions on Cybernetics.

[57]  Talayeh Razzaghi,et al.  A cost-sensitive convolution neural network learning for control chart pattern recognition , 2020, Expert Syst. Appl..

[58]  Mengjie Zhang,et al.  Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism , 2020, Soft Comput..

[59]  Francisco Herrera,et al.  A preliminary study on overlapping and data fracture in imbalanced domains by means of Genetic Programming-based feature extraction , 2010, 2010 10th International Conference on Intelligent Systems Design and Applications.

[60]  Malcolm I. Heywood,et al.  GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation , 2008, EuroGP.