Efficient estimation in a partially specified nonignorable propensity score model

Abstract Consider the regression setting where the response variable is subject to missing data and the covariates are fully observed. A nonignorable propensity score model, i.e., the probability that the response is observed conditional on all variables depends on the missing values themselves, is assumed throughout the paper. In such problems, model misspecification and model identifiability are two critical issues. A fully parametric approach can produce results that are sensitive to the model assumptions, while a fully nonparametric approach may not be sufficient for model identification. A new flexible semiparametric propensity score model is proposed where the relationship between the missingness indicator and the partially observed response is totally unspecified and estimated nonparametrically, while the relationship between the missingness indicator and the fully observed covariates is modeled parametrically. The proposed estimator is constructed via a semiparametric treatment and is proved to be semiparametrically efficient. Comprehensive simulation studies are conducted to examine the finite-sample performance of the estimators. While the naive parametric method leads to heavily biased estimator and poor coverage results, the proposed method produces estimator with negligible finite-sample biases and also correct inference results. The proposed method is further illustrated via an electronic health records (EHR) data application for the albumin level in the blood sample. The empirical analyses demonstrated that the proposed semiparametric propensity score model is more sensible than a purely parametric model. The proposed method could be very useful to uncover the unknown and possibly nonlinear dependence of the propensity score model to the albumin level, and is recommended for practical use.

[1]  Jun Shao,et al.  Semiparametric Pseudo-Likelihoods in Generalized Linear Models With Nonignorable Missing Data , 2015 .

[2]  Lei Wang,et al.  Semiparametric inverse propensity weighting for nonignorable missing data , 2016 .

[3]  J G Ibrahim,et al.  Parameter estimation from incomplete data in binomial regression when the missing data mechanism is nonignorable. , 1996, Biometrics.

[4]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[5]  Zhi Geng,et al.  Identifiability of Normal and Normal Mixture Models with Nonignorable Missing Data , 2015, 1509.03860.

[6]  J. Robins,et al.  Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models. , 1997, Statistics in medicine.

[7]  Eric J Tchetgen Tchetgen,et al.  A general instrumental variable framework for regression analysis with outcome missing not at random , 2017, Biometrics.

[8]  D. Rubin,et al.  MULTIPLE IMPUTATIONS IN SAMPLE SURVEYS-A PHENOMENOLOGICAL BAYESIAN APPROACH TO NONRESPONSE , 2002 .

[9]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[10]  Jae Kwang Kim,et al.  An Instrumental Variable Approach for Identification and Estimation with Nonignorable Nonresponse , 2014 .

[11]  Jae Kwang Kim,et al.  Statistical Methods for Handling Incomplete Data , 2013 .

[12]  M. Kenward,et al.  Handbook of Missing Data Methodology , 2019 .

[13]  W. Gilks,et al.  Adaptive Rejection Sampling for Gibbs Sampling , 1992 .

[14]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[15]  Vipin Kumar,et al.  Don't Do Imputation: Dealing with Informative Missing Values in EHR Data Analysis , 2018, 2018 IEEE International Conference on Big Knowledge (ICBK).

[16]  Jun Shao,et al.  Estimation in longitudinal studies with nonignorable dropout , 2013 .

[17]  Christina Heinze-Deml,et al.  Invariant Causal Prediction for Nonlinear Models , 2017, Journal of Causal Inference.

[18]  J. Burns,et al.  Correlation between serum ionised calcium and serum albumin concentrations in two hospital populations. , 1984, British medical journal.

[19]  Sarah Alam,et al.  Correlation Between Serum Albumin Level and Ionized Calcium in Idiopathic Nephrotic Syndrome in Children , 2016 .

[20]  J. Robins,et al.  Analysis of semi-parametric regression models with non-ignorable non-response. , 1997, Statistics in medicine.

[21]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[22]  Jae Kwang Kim,et al.  Semiparametric optimal estimation with nonignorable nonresponse data , 2016, The Annals of Statistics.

[23]  P. Kott Calibration Weighting When Model and Calibration Variables Can Differ , 2014 .

[24]  Chi Chen,et al.  A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records , 2020, Entropy.

[25]  Eric J Tchetgen Tchetgen,et al.  Semiparametric Estimation with Data Missing Not at Random Using an Instrumental Variable. , 2016, Statistica Sinica.

[26]  Jae Kwang Kim,et al.  A Semiparametric Estimation of Mean Functionals With Nonignorable Missing Data , 2011 .

[27]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[28]  Jun Shao,et al.  Estimation With Survey Data Under Nonignorable Nonresponse or Informative Sampling , 2002 .

[29]  Roderick J. A. Little,et al.  Analysis of multivariate missing data with nonignorable nonresponse , 2003 .

[30]  P. Whincup,et al.  ASSOCIATION BETWEEN SERUM ALBUMIN AND MORTALITY FROM CARDIOVASCULAR DISEASE, CANCER, AND OTHER CAUSES , 1989, The Lancet.

[31]  Yanyuan Ma,et al.  A Versatile Estimation Procedure Without Estimating the Nonignorable Missingness Mechanism , 2019, Journal of the American Statistical Association.

[32]  Zhen Hu,et al.  Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record , 2017, J. Biomed. Informatics.

[33]  Yanyuan Ma,et al.  Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse. , 2018, Biometrika.

[34]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[35]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[36]  I. M. Klotz,et al.  Interactions of calcium with serum albumin. , 1953, Archives of biochemistry and biophysics.

[37]  Phillip S. Kott,et al.  Using Calibration Weighting to Adjust for Nonresponse Under a Plausible Model (with full appendices) , 2007 .