Propensity Score Estimation Using Classification and Regression Trees in the Presence of Missing Covariate Data

Abstract Data mining and machine learning techniques such as classification and regression trees (CART) represent a promising alternative to conventional logistic regression for propensity score estimation. Whereas incomplete data preclude the fitting of a logistic regression on all subjects, CART is appealing in part because some implementations allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. Based on theoretical considerations, we argue that the automatic handling of missing data by CART may however not be appropriate. Using a series of simulation experiments, we examined the performance of different approaches to handling missing covariate data; (i) applying the CART algorithm directly to the (partially) incomplete data, (ii) complete case analysis, and (iii) multiple imputation. Performance was assessed in terms of bias in estimating exposure-outcome effects among the exposed, standard error, mean squared error and coverage. Applying the CART algorithm directly to incomplete data resulted in bias, even in scenarios where data were missing completely at random. Overall, multiple imputation followed by CART resulted in the best performance. Our study showed that automatic handling of missing data in CART can cause serious bias and does not outperform multiple imputation as a means to account for missing data.

[1]  Daniel Westreich,et al.  Berkson's bias, selection bias, and missing data. , 2012, Epidemiology.

[2]  Brian K. Lee,et al.  Antidepressants during pregnancy and autism in offspring: population based cohort study , 2017, British Medical Journal.

[3]  Til Stürmer,et al.  A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. , 2006, Journal of clinical epidemiology.

[4]  Karel G M Moons,et al.  Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis , 2012, Canadian Medical Association Journal.

[5]  Harry Hemingway,et al.  CALIBERrfimpute: Imputation in MICE using Random Forest , 2014 .

[6]  D. Rubin,et al.  Estimating and Using Propensity Scores with Partially Missing Data , 2000 .

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[9]  Jeremy A Rassen,et al.  Metrics for covariate balance in cohort studies of causal effects , 2014, Statistics in medicine.

[10]  S. Schneeweiss,et al.  Evaluating uses of data mining techniques in propensity score estimation: a simulation study , 2008, Pharmacoepidemiology and drug safety.

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  C. Drake Effects of misspecification of the propensity score on estimators of treatment effect , 1993 .

[13]  Michael G Hudgens,et al.  Generalizing Study Results: A Potential Outcomes Perspective. , 2017, Epidemiology.

[14]  D. Rubin For objective causal inference, design trumps analysis , 2008, 0811.1640.

[15]  Theo Stijnen,et al.  Using the outcome for imputation of missing predictor values was preferred. , 2006, Journal of clinical epidemiology.

[16]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[17]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[18]  Alan R. Ellis,et al.  The role of prediction modeling in propensity score estimation: an evaluation of logistic regression, bCART, and the covariate-balancing propensity score. , 2014, American journal of epidemiology.

[19]  J. Neyman,et al.  Statistical Problems in Agricultural Experimentation , 1935 .

[20]  P. Holland CAUSAL INFERENCE, PATH ANALYSIS AND RECURSIVE STRUCTURAL EQUATIONS MODELS , 1988 .

[21]  T. Therneau,et al.  An Introduction to Recursive Partitioning Using the RPART Routines , 2015 .

[22]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[23]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[24]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[27]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[28]  P. Austin,et al.  Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies , 2010, Pharmaceutical statistics.

[29]  E. Stuart,et al.  Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies , 2015, Statistics in medicine.

[30]  Rolf H H Groenwold,et al.  Reporting of covariate selection and balance assessment in propensity score analysis is suboptimal: a systematic review. , 2015, Journal of clinical epidemiology.

[31]  Tyler J VanderWeele,et al.  On causal inference in the presence of interference , 2012, Statistical methods in medical research.

[32]  Stephen G West,et al.  Propensity score analysis with missing data. , 2016, Psychological methods.

[33]  G. Ridgeway The State of Boosting ∗ , 1999 .

[34]  J. Carpenter,et al.  Practice of Epidemiology Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study , 2014 .

[35]  Elizabeth A Stuart,et al.  Improving propensity score weighting using machine learning , 2010, Statistics in medicine.

[36]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[37]  A. Hoes,et al.  Sensitivity analyses to estimate the potential impact of unmeasured confounding in causal research. , 2010, International journal of epidemiology.

[38]  Stephen R Cole,et al.  The consistency statement in causal inference: a definition or an assumption? , 2009, Epidemiology.

[39]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[40]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[41]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[42]  L. L. Doove,et al.  Recursive partitioning for missing data imputation in the presence of interaction effects , 2014, Comput. Stat. Data Anal..

[43]  Daniel Westreich,et al.  Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. , 2010, Journal of clinical epidemiology.

[44]  P. Holland Causal Inference, Path Analysis and Recursive Structural Equations Models. Program Statistics Research, Technical Report No. 88-81. , 1988 .

[45]  R. Groenwold,et al.  Comments on propensity score matching following multiple imputation , 2016, Statistical methods in medical research.

[46]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[47]  D. McCaffrey,et al.  Propensity score estimation with boosted regression for evaluating causal effects in observational studies. , 2004, Psychological methods.

[48]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[49]  Anthonius Boer,et al.  Measuring balance and model selection in propensity score methods , 2011, Pharmacoepidemiology and drug safety.

[50]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[51]  P. Holland Statistics and Causal Inference , 1985 .