Propensity score and proximity matching using random forest.

In order to derive unbiased inference from observational data, matching methods are often applied to produce balanced treatment and control groups in terms of all background variables. Propensity score has been a key component in this research area. However, propensity score based matching methods in the literature have several limitations, such as model mis-specifications, categorical variables with more than two levels, difficulties in handling missing data, and nonlinear relationships. Random forest, averaging outcomes from many decision trees, is nonparametric in nature, straightforward to use, and capable of solving these issues. More importantly, the precision afforded by random forest (Caruana et al., 2008) may provide us with a more accurate and less model dependent estimate of the propensity score. In addition, the proximity matrix, a by-product of the random forest, may naturally serve as a distance measure between observations that can be used in matching. The proposed random forest based matching methods are applied to data from the National Health and Nutrition Examination Survey (NHANES). Our results show that the proposed methods can produce well balanced treatment and control groups. An illustration is also provided that the methods can effectively deal with missing data in covariates.

[1]  Xin Yan,et al.  Facilitating score and causal inference trees for large observational studies , 2012, J. Mach. Learn. Res..

[2]  R. D'Agostino Adjustment Methods: Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non‐Randomized Control Group , 2005 .

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[5]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[6]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  J. Hayes,et al.  Using multiple imputation and propensity scores to test the effect of car seats and seat belt usage on injury severity from trauma registry data. , 2008, Journal of pediatric surgery.

[9]  Jeffrey S. Simonoff,et al.  An Investigation of Missing Data Methods for Classification Trees , 2006, J. Mach. Learn. Res..

[10]  Elizabeth A Stuart,et al.  Improving propensity score weighting using machine learning , 2010, Statistics in medicine.

[11]  Jerome P. Reiter,et al.  A comparison of two methods of estimating propensity scores after multiple imputation , 2016, Statistical methods in medical research.

[12]  Elizabeth A Stuart,et al.  Matching methods for causal inference: A review and a look forward. , 2010, Statistical science : a review journal of the Institute of Mathematical Statistics.

[13]  F. Paccaud,et al.  Consequences of smoking for body weight, body fat distribution, and insulin resistance. , 2008, The American journal of clinical nutrition.

[14]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[15]  Los Angeles,et al.  Missing Data Imputation for Tree-Based Models , 2006 .

[16]  Torsten Hothorn,et al.  Recursive partitioning on incomplete data using surrogate decisions and multiple imputation , 2012, Comput. Stat. Data Anal..

[17]  Daniel Westreich,et al.  Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. , 2010, Journal of clinical epidemiology.

[18]  D. Rubin,et al.  Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score , 1985 .

[19]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .

[20]  Jennifer Hill,et al.  Reducing Bias in Treatment Effect Estimation in Observational Studies Suffering from Missing Data , 2004 .

[21]  D. Schade,et al.  Effect of smoking on hemoglobin A1c and body mass index in patients with type 2 diabetes mellitus. , 2002, Journal of investigative medicine : the official publication of the American Federation for Clinical Research.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  S. Schneeweiss,et al.  Evaluating uses of data mining techniques in propensity score estimation: a simulation study , 2008, Pharmacoepidemiology and drug safety.

[24]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[25]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[26]  Carolin Strobl,et al.  Random Forests with Missing Values in the Covariates , 2010 .

[27]  P. Mahalanobis On the generalized distance in statistics , 1936 .