Genetic programming based feature construction for classification with incomplete data

Missing values are an unavoidable problem in many real-world datasets. Dealing with incomplete data is an crucial requirement for classification because inadequate treatment of missing values often causes large classification error. Feature construction has been successfully applied to improve classification with complete data, but it has been seldom applied to incomplete data. Genetic programming-based multiple feature construction (GPMFC) is a current encouraging feature construction method which uses genetic programming to evolve new multiple features from original features for classification tasks. GPMFC can improve the accuracy and reduce the complexity of many decision trees and rule-based classifiers; however, it cannot directly work with incomplete data. This paper proposes IGPMFC which is extended from GPMFC to tackle with incomplete data. IGPMFC uses genetic programming with interval functions to directly evolve multiple features for classification with incomplete data. Experimental results reveal that not only IGPMFC can substantially improve the accuracy, but also can reduce the complexity of learnt classifiers facing with incomplete data.

[1]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[2]  Ashwin Srinivasan,et al.  Feature construction with Inductive Logic Programming: A Study of Quantitative Predictions of Biological Activity Aided by Structural Attributes , 1999, Data Mining and Knowledge Discovery.

[3]  Mengjie Zhang,et al.  A Filter Approach to Multiple Feature Construction for Symbolic Learning Classifiers Using Genetic Programming , 2012, IEEE Transactions on Evolutionary Computation.

[4]  Mengjie Zhang,et al.  Directly Constructing Multiple Features for Classification with Missing Data using Genetic Programming with Interval Functions , 2016, GECCO.

[5]  Mengjie Zhang,et al.  Multiple Imputation for Missing Data Using Genetic Programming , 2015, GECCO.

[6]  Bir Bhanu,et al.  Evolutionary feature synthesis for object recognition , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  Geoff Holmes,et al.  Fast Perceptron Decision Tree Learning from Evolving Data Streams , 2010, PAKDD.

[8]  Larry Bull,et al.  Genetic Programming with a Genetic Algorithm for Feature Construction and Selection , 2005, Genetic Programming and Evolvable Machines.

[9]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[10]  David R. Musser,et al.  Introspective Sorting and Selection Algorithms , 1997, Softw. Pract. Exp..

[11]  G. William Walster,et al.  Global Optimization Using Interval Analysis: Revised and Expanded , 2007 .

[12]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[13]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[14]  Bir Bhanu,et al.  Fingerprint classification based on learned features , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[16]  Evolutionary Constructive Induction , 2017, Encyclopedia of Machine Learning and Data Mining.

[17]  Mengjie Zhang,et al.  A Genetic Programming-Based Imputation Method for Classification with Missing Data , 2016, EuroGP.

[18]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[19]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[20]  J. Graham,et al.  Missing data analysis: making it work in the real world. , 2009, Annual review of psychology.

[21]  Qing Zhang,et al.  Feature extraction and dimensionality reduction by genetic programming based on the Fisher criterion , 2008, Expert Syst. J. Knowl. Eng..

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[24]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[25]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[26]  Mengjie Zhang,et al.  Impact of imputation of missing values on genetic programming based multiple feature construction for classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[27]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[28]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[29]  Francisco Herrera,et al.  A Survey on the Application of Genetic Programming to Classification , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[30]  Mengjie Zhang,et al.  Directly evolving classifiers for missing data using genetic programming , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[31]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[32]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .