A Comparison of Classification Strategies in Genetic Programming with Unbalanced Data

Machine learning algorithms like Genetic Programming (GP) can evolve biased classifiers when data sets are unbalanced. In this paper we compare the effectiveness of two GP classification strategies. The first uses the standard (zero) class-threshold, while the second uses the “best” class-threshold determined dynamically on a solution-by-solution basis during evolution. These two strategies are evaluated using five different GP fitness across across a range of binary class imbalance problems, and the GP approaches are compared to other popular learning algorithms, namely, Naive Bayes and Support Vector Machines. Our results suggest that there is no overall difference between the two strategies, and that both strategies can evolve good solutions in binary classification when used in combination with an effective fitness function.

[1]  Stephan M. Winkler,et al.  Advanced Genetic Programming Based Machine Learning , 2007, J. Math. Model. Algorithms.

[2]  J. Holmes Differential Negative Reinforcement Improves Classifier System Learning Rate in Two-Class Problems with Unequal Base Rates , 1990 .

[3]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[4]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[5]  Malcolm I. Heywood,et al.  GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation , 2008, EuroGP.

[6]  Mengjie Zhang,et al.  Using Gaussian distribution to construct fitness functions in genetic programming for multiclass object classification , 2006, Pattern Recognit. Lett..

[7]  Michael C. Mozer,et al.  Optimizing Classifier Performance Via the Wilcoxon-Mann-Whitney Statistic , 2003, ICML 2003.

[8]  Hitoshi Iba,et al.  Genetic Programming 1998: Proceedings of the Third Annual Conference , 1999, IEEE Trans. Evol. Comput..

[9]  Michael C. Mozer,et al.  Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic , 2003, ICML.

[10]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[11]  Dariu Gavrila,et al.  An Experimental Study on Pedestrian Classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Mark Johnston,et al.  Genetic programming for image classification with unbalanced data , 2009, 2009 24th International Conference Image and Vision Computing New Zealand.