Random forest missing data algorithms

Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting-the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

[1]  Hemant Ishwaran,et al.  The effect of splitting on random forests , 2014, Machine Learning.

[2]  Stef van Buuren,et al.  Multiple imputation of discrete and continuous data by fully conditional specification , 2007 .

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[6]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..

[7]  James R Carpenter,et al.  Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model , 2012, Statistical methods in medical research.

[8]  Bhekisipho Twala,et al.  Ensemble missing data techniques for software effort prediction , 2010, Intell. Data Anal..

[9]  Mark R. Segal,et al.  Multivariate random forests , 2011, WIREs Data Mining Knowl. Discov..

[10]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[11]  Yan Lin,et al.  Missing value imputation in high-dimensional phenomic data: imputable or not, and how? , 2014, BMC Bioinformatics.

[12]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[13]  L. L. Doove,et al.  Recursive partitioning for missing data imputation in the presence of interaction effects , 2014, Comput. Stat. Data Anal..

[14]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[15]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[16]  R. Little Pattern-Mixture Models for Multivariate Incomplete Data , 1993 .

[17]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[18]  Sabrina Eberhart,et al.  Applied Missing Data Analysis , 2016 .

[19]  David J. Hand,et al.  Good methods for coping with missing data in decision trees , 2008, Pattern Recognit. Lett..

[20]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[21]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[22]  J. Carpenter,et al.  Practice of Epidemiology Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study , 2014 .

[23]  J. Marrero,et al.  Comparison of imputation methods for missing laboratory data in medicine , 2013, BMJ Open.