Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism

Genetic programming (GP) has been successfully applied to classification. However, GP may evolve biased classifiers when encountering the problem of class imbalance. These biased classifiers are often not reliable to be applied to some real-world applications. High dimensionality makes it more difficult for classifiers to effectively separate the majority class and the minority class. The use of GP to handle the joint effect of high dimensionality and class imbalance has not been heavily investigated. In this paper, we propose a GP approach to high-dimensional imbalanced classification, with the goals of increasing the classification performance as well as saving training time. To achieve this goal, a new fitness function is developed to solve the problem of class imbalance, and moreover, a strategy is proposed to reuse previous good GP individuals for improving efficiency. The proposed method is examined on ten high-dimensional imbalanced datasets. Experimental results show that, for high-dimensional imbalanced classification, the proposed method generally outperforms other GP methods and traditional classification algorithms using sampling methods to solve the problem of class imbalance.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Mark Johnston,et al.  Evolving Diverse Ensembles Using Genetic Programming for Classification With Unbalanced Data , 2013, IEEE Transactions on Evolutionary Computation.

[3]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[4]  Jerzy Stefanowski,et al.  Dealing with Data Difficulty Factors While Learning from Imbalanced Data , 2016, Challenges in Computational Statistics and Data Mining.

[5]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[6]  Xin Yao,et al.  Cost-sensitive classification with genetic programming , 2005, 2005 IEEE Congress on Evolutionary Computation.

[7]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[8]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[9]  María José del Jesús,et al.  Mining Context-Aware Association Rules Using Grammar-Based Genetic Programming , 2018, IEEE Transactions on Cybernetics.

[10]  ShangJennifer,et al.  Learning from class-imbalanced data , 2017 .

[11]  Francisco Herrera,et al.  A Survey on the Application of Genetic Programming to Classification , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[12]  Peng Li,et al.  Hybrid Kernel Machine Ensemble for Imbalanced Data Sets , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[13]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Keke Gai,et al.  An Empirical Study on Preprocessing High-Dimensional Class-Imbalanced Data for Classification , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[15]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[16]  Albert Y. Zomaya,et al.  A particle swarm based hybrid system for imbalanced medical data sampling , 2009, BMC Genomics.

[17]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[18]  Mihrimah Özmen,et al.  CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems , 2019, Neural Computing and Applications.

[19]  Russel Pears,et al.  Synthetic Minority Over-sampling TEchnique (SMOTE) for Predicting Software Build Outcomes , 2014, SEKE.

[20]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[21]  Michel Vacher,et al.  SVM-Based Multimodal Classification of Activities of Daily Living in Health Smart Homes: Sensors, Algorithms, and First Experimental Results , 2010, IEEE Transactions on Information Technology in Biomedicine.

[22]  Mengjie Zhang,et al.  Genetic programming for feature construction and selection in classification on high-dimensional data , 2016, Memetic Comput..

[23]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[24]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[25]  Sheng Chen,et al.  A Kernel-Based Two-Class Classifier for Imbalanced Data Sets , 2007, IEEE Transactions on Neural Networks.

[26]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[27]  Shichao Zhang,et al.  "Missing is useful": missing values in cost-sensitive decision trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[28]  Zhen Fang,et al.  Mortality prediction based on imbalanced high-dimensional ICU big data , 2018, Comput. Ind..

[29]  Malcolm I. Heywood,et al.  A Linear Genetic Programming Approach to Intrusion Detection , 2003, GECCO.

[30]  Taghi M. Khoshgoftaar,et al.  An empirical study of the classification performance of learners on imbalanced and noisy software quality data , 2014, Inf. Sci..

[31]  Mark Johnston,et al.  Ensemble Learning and Pruning in Multi-Objective Genetic Programming for Classification with Unbalanced Data , 2011, Australasian Conference on Artificial Intelligence.

[32]  Mengjie Zhang,et al.  Genetic Programming Based on Granular Computing for Classification with High-Dimensional Data , 2018, Australasian Conference on Artificial Intelligence.

[33]  Rok Blagus,et al.  Improved shrunken centroid classifiers for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[34]  Mark Johnston,et al.  Developing New Fitness Functions in Genetic Programming for Classification With Unbalanced Data , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[36]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[37]  Mengjie Zhang,et al.  Fitness Functions in Genetic Programming for Classification with Unbalanced Data , 2007, Australian Conference on Artificial Intelligence.

[38]  William W. Hsieh,et al.  Nonlinear principal component analysis of noisy data , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[39]  William W. Hsieh Nonlinear principal component analysis of noisy data , 2007, Neural Networks.

[40]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[41]  Mark Johnston,et al.  Evolving ensembles in multi-objective genetic programming for classification with unbalanced data , 2011, GECCO '11.

[42]  Mark Johnston,et al.  Reusing Genetic Programming for Ensemble Selection in Classification of Unbalanced Data , 2014, IEEE Transactions on Evolutionary Computation.

[43]  Mark Johnston,et al.  Genetic Programming for Classification with Unbalanced Data , 2010, EuroGP.

[44]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[45]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[46]  Xuehua Wang,et al.  Feature selection for high-dimensional imbalanced data , 2013, Neurocomputing.

[47]  Zili Zhang,et al.  Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications , 2014, IEEE Transactions on Cybernetics.

[48]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[49]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[50]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[51]  Karim Faez,et al.  Boosted Bayesian Kernel Classifier Method for Face Detection , 2007, Third International Conference on Natural Computation (ICNC 2007).

[52]  Mengjie Zhang,et al.  Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data , 2017, EuroGP.

[53]  Peter Ross,et al.  Dynamic Training Subset Selection for Supervised Learning in Genetic Programming , 1994, PPSN.

[54]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[55]  Malcolm I. Heywood,et al.  Scaling Genetic Programming to Large Datasets Using Hierarchical Dynamic Subset Selection , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).