To Adjust or not to Adjust? Estimating the Average Treatment Effect in Randomized Experiments with Missing Covariates

Complete randomization allows for consistent estimation of the average treatment effect based on the difference in means of the outcomes without strong modeling assumptions on the outcome-generating process. Appropriate use of the pretreatment covariates can further improve the estimation efficiency. However, missingness in covariates is common in experiments and raises an important question: should we adjust for covariates subject to missingness, and if so, how? The unadjusted difference in means is always unbiased. The complete-covariate analysis adjusts for all completely observed covariates and improves the efficiency of the difference in means if at least one completely observed covariate is predictive of the outcome. Then what is the additional gain of adjusting for covariates subject to missingness? A key insight is that the missingness indicators act as fully observed pretreatment covariates as long as missingness is not affected by the treatment, and can thus be used in covariate adjustment to bring additional estimation efficiency. This motivates adding the missingness indicators to the regression adjustment, yielding the missingness-indicator method as a well-known but not so popular strategy in the literature of missing data. We recommend it due to its many advantages. First, it removes the dependence of the regression-adjusted estimators on the imputed values for the missing covariates. Second, it improves the estimation efficiency of the completecovariate analysis and the regression analysis based on only the imputed covariates. Third, it does not require modeling the missingness mechanism and yields a consistent and efficient estimator even if the missing-data mechanism is related to the missing covariates and unobservable potential outcomes. Lastly, it is easy to implement via standard software packages for least squares. We also propose modifications to the missingness-indicator method based on asymptotic and finite-sample considerations. To reconcile the conflicting recommendations in the missing data literature, we analyze and compare various strategies for analyzing randomized experiments with missing covariates under the design-based framework. This framework treats randomization as the basis for inference and does not impose any modeling assumptions on the outcome-generating process and missing-data mechanism.

[1]  Lingling Li,et al.  Inverse probability weighting for covariate adjustment in randomized studies , 2014, Statistics in medicine.

[2]  W. Lin,et al.  Agnostic notes on regression adjustments to experimental data: Reexamining Freedman's critique , 2012, 1208.2301.

[3]  Ian R White,et al.  Adjusting for partially missing baseline measurements in randomized trials , 2005, Statistics in medicine.

[4]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[5]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[6]  Jerome P. Reiter,et al.  Leveraging random assignment to impute missing covariates in causal studies , 2019, Journal of Statistical Computation and Simulation.

[7]  David A. Freedman,et al.  On regression adjustments to experimental data , 2008, Adv. Appl. Math..

[8]  R. Rummel Applied Factor Analysis , 1970 .

[9]  Alexander Basilevsky,et al.  Chapter 12 – Missing Data: A Review of the Literature , 1983 .

[11]  Frederic M. Lord,et al.  Estimation of Parameters from Incomplete Data , 1954 .

[12]  Cun-Hui Zhang,et al.  Lasso adjustments of treatment effect estimates in randomized experiments , 2015, Proceedings of the National Academy of Sciences.

[13]  Shahab Jolani,et al.  Imputation of missing covariate in randomized controlled trials with a continuous outcome: Scoping review and new results , 2020, Pharmaceutical statistics.

[14]  P. Aronow,et al.  Unbiased Estimation of the Average Treatment Effect in Cluster-Randomized Experiments , 2011 .

[15]  Luke W. Miratrix,et al.  Adjusting treatment effect estimates by post‐stratification in randomized experiments , 2013 .

[16]  J. Wooldridge,et al.  Revisiting regression adjustment in experiments with heterogeneous treatment effects , 2020 .

[17]  Nicole E. Pashley,et al.  Insights on Variance Estimation for Blocked and Matched Pairs Designs , 2017, Journal of Educational and Behavioral Statistics.

[18]  M. Glasser,et al.  Linear Regression Analysis with Missing Observations among the Independent Variables , 1964 .

[19]  Alessandra Mattei,et al.  Estimating and using propensity score in presence of missing background data: an application to assess the impact of childbearing on wellbeing , 2009, Stat. Methods Appl..

[20]  Michael P. Jones Indicator and stratification methods for missing explanatory variables in multiple linear regression , 1996 .

[21]  Dylan S. Small,et al.  Discrete Optimization for Interpretable Study Populations and Randomization Inference in an Observational Study of Severe Sepsis Mortality , 2014, 1411.4873.

[22]  S Greenland,et al.  A critical look at methods for handling missing covariates in epidemiologic regression analyses. , 1995, American journal of epidemiology.

[23]  D. Rubin,et al.  Estimating and Using Propensity Scores with Partially Missing Data , 2000 .

[24]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[25]  P. Ding,et al.  General Forms of Finite Population Central Limit Theorems with Applications to Causal Inference , 2016, 1610.04821.

[26]  Peng Ding,et al.  Covariate-adjusted Fisher randomization tests for the average treatment effect , 2020, Journal of Econometrics.

[27]  Efficient evaluation of treatment effects in the presence of missing covariate values. , 1990, Statistics in medicine.

[28]  Ian R White,et al.  Should multiple imputation be the method of choice for handling missing data in randomized trials? , 2016, Statistical methods in medical research.

[29]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[30]  Karel G M Moons,et al.  Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis , 2012, Canadian Medical Association Journal.

[31]  S. Kruger Design Of Observational Studies , 2016 .

[32]  Y. Haitovsky Missing Data in Regression Analysis , 1968 .

[33]  P. Ding,et al.  Causal inference with confounders missing not at random , 2017, Biometrika.

[34]  M. Davidian,et al.  Covariate adjustment for two‐sample treatment comparisons in randomized clinical trials: A principled yet flexible approach , 2008, Statistics in medicine.

[35]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[36]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .

[37]  Jonathan Robinson,et al.  Nudging Farmers to Use Fertilizer: Theory and Experimental Evidence from Kenya , 2009 .

[38]  P. Ding,et al.  Rerandomization and regression adjustment , 2019, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[39]  Shuxi Zeng,et al.  Propensity score weighting for covariate adjustment in randomized clinical trials , 2020, Statistics in medicine.

[40]  Michael G. Kenward,et al.  Missing data in randomised controlled trials: a practical guide , 2007 .

[41]  S. S. Wilks Moments and Distributions of Estimates of Population Parameters from Fragmentary Samples , 1932 .

[42]  G. Styan Hadamard products and multivariate statistical analysis , 1973 .

[43]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[44]  Sander Greenland,et al.  Theoretical Epidemiology: Principles of Occurrence Research in Medicine , 1986 .

[45]  J. I The Design of Experiments , 1936, Nature.

[46]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[47]  D. Rubin INFERENCE AND MISSING DATA , 1975 .