Evolving regression trees robust to missing data

Data quality is a major concern in several fields of knowledge that rely on data analysis. Missing data, in particular, have a strong negative impact in machine learning, potentially harming the knowledge extraction process by skewing results and affecting the predictive performance of the induced models. For dealing with the problem of missing data, the literature in machine learning offers a variety of strategies which can be either in the form of a preprocessing step or of an embedded solution within a predictive method. In this paper, we propose a novel evolutionary algorithm for regression tree induction, which has embedded in its evolutionary cycle a robust framework for dealing with missing data. For comparison purposes, we evaluate six traditional regression algorithms over 10 public regression datasets that were artificially modified to present different levels of missing data. Results from the experimental analysis show that the proposed approach is the one that is less affected by the increasing levels of missing data, presenting an interesting trade-off between model interpretability and predictive performance especially for datasets with more than 40% of missing data.

[1]  Francisco Herrera,et al.  A study on the use of imputation methods for experimentation with Radial Basis Function Network classifiers handling missing attribute values: The good synergy between RBFNs and EventCovering method , 2010, Neural Networks.

[2]  Marek Kretowski,et al.  An evolutionary algorithm for global induction of regression and model trees , 2013, Int. J. Data Min. Model. Manag..

[3]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[4]  Alex Alves Freitas,et al.  Evolutionary model tree induction , 2010, SAC '10.

[5]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[6]  C. Parthiban,et al.  Imputation for the analysis of missing values and prediction of time series data , 2011, 2011 International Conference on Recent Trends in Information Technology (ICRTIT).

[7]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Alex Alves Freitas,et al.  LEGAL-tree: a lexicographic multi-objective genetic algorithm for decision tree induction , 2009, SAC '09.

[10]  Guangzhe Fan,et al.  Regression Tree Analysis Using TARGET , 2005 .

[11]  Andries Petrus Engelbrecht,et al.  Evolving model trees for mining data sets with continuous-valued classes , 2008, Expert Syst. Appl..

[12]  Marek Kretowski,et al.  An Evolutionary Algorithm for Global Induction of Regression Trees , 2010, ICAISC.

[13]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[14]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[15]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[16]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[17]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation) , 2006 .

[18]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[19]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[20]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Decision-Tree Induction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[21]  Rodrigo C. Barros,et al.  Evolutionary model trees for handling continuous classes in machine learning , 2011, Inf. Sci..

[22]  Ian Witten,et al.  Data Mining , 2000 .

[23]  J. R. Quinlan Learning With Continuous Classes , 1992 .