A Genetic Programming-Based Imputation Method for Classification with Missing Data

Many industrial and real-world datasets suffer from an unavoidable problem of missing values. The ability to deal with missing values is an essential requirement for classification because inadequate treatment of missing values may lead to large errors on classification. The problem of missing data has been addressed extensively in the statistics literature, and also, but to a lesser extent in the classification literature. One of the most popular approaches to deal with missing data is to use imputation methods to fill missing values with plausible values. Some powerful imputation methods such as regression-based imputations in MICE [36] are often suitable for batch imputation tasks. However, they are often expensive to impute missing values for every single incomplete instance in the unseen set for classification. This paper proposes a genetic programming-based imputation (GPI) method for classification with missing data that uses genetic programming as a regression method to impute missing values. The experiments on six benchmark datasets and five popular classifiers compare GPI with five other popular and advanced regression-based imputation methods in MICE on two measures: classification accuracy and computation time. The results showed that, in most cases, GPI achieves classification accuracy at least as good as the other imputation methods, and sometimes significantly better. However, using GPI to impute missing values for every single incomplete instance is dramatically faster than the other imputation methods.

[1]  Michael O'Neill,et al.  Genetic Programming and Evolvable Machines Manuscript No. Semantically-based Crossover in Genetic Programming: Application to Real-valued Symbolic Regression , 2022 .

[2]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[3]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[4]  Helio J. C. Barbosa,et al.  Symbolic regression via genetic programming , 2000, Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks.

[5]  Dick den Hertog,et al.  Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming , 2009, IEEE Transactions on Evolutionary Computation.

[6]  Gm Gero Walter,et al.  Bayesian linear regression , 2009 .

[7]  J. Graham,et al.  Missing data analysis: making it work in the real world. , 2009, Annual review of psychology.

[8]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[9]  Maarten Keijzer,et al.  Improving Symbolic Regression with Interval Arithmetic and Linear Scaling , 2003, EuroGP.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Athanasios Tsakonas,et al.  Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation , 2011 .

[12]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[13]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[14]  A. Topchy,et al.  Faster genetic programming based on local gradient search of numeric leaf values , 2001 .

[15]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[16]  Alexandros Agapitos,et al.  Controlling Overfitting in Symbolic Regression Based on a Bias/Variance Error Decomposition , 2012, PPSN.

[17]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[18]  Xiao-Li Meng,et al.  Applications of multiple imputation in medical studies: from AIDS to NHANES , 1999, Statistical methods in medical research.

[19]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[20]  Leonardo Vanneschi,et al.  Operator equalisation for bloat free genetic programming and a survey of bloat control methods , 2011, Genetic Programming and Evolvable Machines.

[21]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[22]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[23]  D. Kleinbaum,et al.  Applied Regression Analysis and Other Multivariate Methods , 1978 .

[24]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[27]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[28]  Nong Ye,et al.  Naïve Bayes Classifier , 2013 .

[29]  Roderick J A Little,et al.  A Review of Hot Deck Imputation for Survey Non‐response , 2010, International statistical review = Revue internationale de statistique.

[30]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[31]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[32]  Mengjie Zhang,et al.  Multiple Imputation for Missing Data Using Genetic Programming , 2015, GECCO.

[33]  N. Draper,et al.  Applied Regression Analysis , 1966 .