Developing New Fitness Functions in Genetic Programming for Classification With Unbalanced Data

Machine learning algorithms such as genetic programming (GP) can evolve biased classifiers when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria in the fitness function. This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data. Using a range of real-world classification problems with class imbalance, we empirically show that these new fitness functions evolve classifiers with good performance on both the minority and majority classes. Our approaches use the original unbalanced training data in the GP learning process, without the need to artificially balance the training examples from the two classes (e.g., via sampling).

[1]  William B. Langdon,et al.  Genetic Programming for Improved Receiver Operating Characteristics , 2001, Multiple Classifier Systems.

[2]  Mark Johnston,et al.  A Comparison of Classification Strategies in Genetic Programming with Unbalanced Data , 2010, Australasian Conference on Artificial Intelligence.

[3]  Gustavo E. A. P. A. Batista,et al.  Learning with Skewed Class Distributions , 2002 .

[4]  Michael C. Mozer,et al.  Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic , 2003, ICML.

[5]  Walter A. Kosters,et al.  Genetic Programming for data classification: partitioning the search space , 2004, SAC '04.

[6]  J. Tukey,et al.  Components in regression. , 1951, Biometrics.

[7]  Gustavo E. A. P. A. Batista,et al.  Balancing Strategies and Class Overlapping , 2005, IDA.

[8]  Malcolm I. Heywood,et al.  Scaling Genetic Programming to Large Datasets Using Hierarchical Dynamic Subset Selection , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Gary Weiss,et al.  Does cost-sensitive learning beat sampling for classifying rare classes? , 2005, UBDM '05.

[10]  J. Holmes Differential Negative Reinforcement Improves Classifier System Learning Rate in Two-Class Problems with Unequal Base Rates , 1990 .

[11]  Ah Chung Tsoi,et al.  Neural Network Classification and Prior Class Probabilities , 1996, Neural Networks: Tricks of the Trade.

[12]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[13]  Ester Bernadó-Mansilla,et al.  Class imbalance problem in UCS classifier system: fitness adaptation , 2005, 2005 IEEE Congress on Evolutionary Computation.

[14]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[15]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[16]  Christian Igel,et al.  Multi-Objective Optimization of Support Vector Machines , 2006, Multi-Objective Machine Learning.

[17]  Charles X. Ling,et al.  Constructing New and Better Evaluation Measures for Machine Learning , 2007, IJCAI.

[18]  Salvatore J. Stolfo,et al.  Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 , 1997 .

[19]  Michael C. Mozer,et al.  Optimizing Classifier Performance Via the Wilcoxon-Mann-Whitney Statistic , 2003, ICML 2003.

[20]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[21]  José Martínez Sotoca,et al.  Improving the Classification Accuracy of RBF and MLP Neural Networks Trained with Imbalanced Samples , 2006, IDEAL.

[22]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[23]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[24]  T. Cheng,et al.  An application of classification analysis for skewed class distribution in therapeutic drug monitoring - the case of vancomycin , 2004, 2004 IDEAS Workshop on Medical Information Systems: The Digital Hospital (IDEAS-DH'04).

[25]  Stephan M. Winkler,et al.  Using enhanced genetic programming techniques for evolving classifiers in the context of medical diagnosis , 2009, Genetic Programming and Evolvable Machines.

[26]  Mark Johnston,et al.  Evolving ensembles in multi-objective genetic programming for classification with unbalanced data , 2011, GECCO '11.

[27]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[28]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[29]  Geoff Holmes,et al.  The Positive Effects of Negative Information: Extending One-Class Classification Models in Binary Proteomic Sequence Classification , 2009, Australasian Conference on Artificial Intelligence.

[30]  Ken Sharman,et al.  A Genetic Programming Approach for Bankruptcy Prediction Using a Highly Unbalanced Database , 2007, EvoWorkshops.

[31]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[32]  Witold Jacak,et al.  Classification of tumor marker values using heuristic data mining methods , 2010, GECCO '10.

[33]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[34]  Geoffrey J. McLachlan,et al.  Ensemble Approach for the Classification of Imbalanced Data , 2009, Australasian Conference on Artificial Intelligence.

[35]  Malcolm I. Heywood,et al.  Training genetic programming on half a million patterns: an example from anomaly detection , 2005, IEEE Transactions on Evolutionary Computation.

[36]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[37]  Mark Johnston,et al.  Differentiating between individual class performance in Genetic Programming fitness for classification with unbalanced data , 2009, 2009 IEEE Congress on Evolutionary Computation.

[38]  Vic Ciesielski,et al.  Representing classification problems in genetic programming , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[39]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[40]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[41]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[42]  Mengjie Zhang,et al.  Fitness Functions in Genetic Programming for Classification with Unbalanced Data , 2007, Australian Conference on Artificial Intelligence.

[43]  Mengjie Zhang,et al.  Using Gaussian distribution to construct fitness functions in genetic programming for multiclass object classification , 2006, Pattern Recognit. Lett..

[44]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[45]  Dariu Gavrila,et al.  An Experimental Study on Pedestrian Classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Stephan M. Winkler,et al.  Advanced Genetic Programming Based Machine Learning , 2007, J. Math. Model. Algorithms.

[47]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[48]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[49]  Andrew R. McIntyre,et al.  Multi-objective competitive coevolution for efficient GP classifier problem decomposition , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[50]  Peter Ross,et al.  Dynamic Training Subset Selection for Supervised Learning in Genetic Programming , 1994, PPSN.

[51]  Mengjie Zhang,et al.  Multiclass Object Classification Using Genetic Programming , 2004, EvoWorkshops.

[52]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[53]  Saharon Rosset,et al.  Model selection via the AUC , 2004, ICML.

[54]  C. Arús,et al.  Genetic Programming for classification of brain tumours from Nuclear Magnetic Resonance biopsy , 1996 .

[55]  Edwin P. D. Pednault,et al.  Handling Imbalanced Data Sets in Insurance Risk Modeling , 2000 .

[56]  Daniel Howard,et al.  Target detection in SAR imagery by genetic programming , 1999 .

[57]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[58]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[59]  Malcolm I. Heywood,et al.  GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation , 2008, EuroGP.

[60]  I Martínez-Pérez,et al.  Genetic programming for classification and feature selection: analysis of 1H nuclear magnetic resonance spectra from human brain tumour biopsies , 1998, NMR in biomedicine.

[61]  Jano I. van Hemert,et al.  Adapting the Fitness Function in GP for Data Mining , 1999, EuroGP.

[62]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[63]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.