Improved generalized raking estimators to address dependent covariate and failure‐time outcome error

Biomedical studies that use electronic health records (EHR) data for inference are often subject to bias due to measurement error. The measurement error present in EHR data is typically complex, consisting of errors of unknown functional form in covariates and the outcome, which can be dependent. To address the bias resulting from such errors, generalized raking has recently been proposed as a robust method that yields consistent estimates without the need to model the error structure. We provide rationale for why these previously proposed raking estimators can be expected to be inefficient in failure-time outcome settings involving misclassification of the event indicator. We propose raking estimators that utilize multiple imputation, to impute either the target variables or auxiliary variables, to improve the efficiency. We also consider outcome-dependent sampling designs and investigate their impact on the efficiency of the raking estimators, either with or without multiple imputation. We present an extensive numerical study to examine the performance of the proposed estimators across various measurement error settings. We then apply the proposed methods to our motivating setting, in which we seek to analyze HIV outcomes in an observational cohort with electronic health records data from the Vanderbilt Comprehensive Care Clinic.

[1]  L. Dodd,et al.  Analysis of progression-free survival data using a discrete time survival model that incorporates measurements with and without diagnostic error , 2010, Clinical Trials.

[2]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[3]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[4]  L. Magder,et al.  Logistic regression when the outcome is measured with uncertainty. , 1997, American journal of epidemiology.

[5]  Robin C. Meili,et al.  Can electronic medical record systems transform health care? Potential health benefits, savings, and costs. , 2005, Health affairs.

[6]  J. Robins,et al.  Inference for imputation estimators , 2000 .

[7]  Barbara Castelnuovo,et al.  Quality of data collection in a large HIV observational clinic database in sub-Saharan Africa: implications for clinical research and audit of care , 2011, Journal of the International AIDS Society.

[8]  A. Winsor Sampling techniques. , 2000, Nursing times.

[9]  Thomas Lumley,et al.  Two-phase analysis and study design for survival models with error-prone exposures , 2020, Statistical methods in medical research.

[10]  Stephen R Cole,et al.  Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. , 2013, American journal of epidemiology.

[11]  G. Hartvigsen,et al.  Secondary Use of EHR: Data Quality Issues and Informatics Opportunities , 2010, Summit on translational bioinformatics.

[12]  T. Lumley Robustness of Semiparametric Efficiency in Nearly-Correct Models for Two-Phase Samples , 2017, 1707.05924.

[13]  Thomas Lumley,et al.  Raking and regression calibration: Methods to address bias from correlated covariate and time‐to‐event error , 2019, Statistics in medicine.

[14]  Changbao Wu,et al.  A Model-Calibration Approach to Using Complete Auxiliary Information From Survey Data , 2001 .

[15]  Raymond J. Carroll,et al.  Measurement error in nonlinear models: a modern perspective , 2006 .

[16]  Richard J Cook,et al.  Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis , 2015, Statistics in medicine.

[17]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[18]  Peisong Han,et al.  Combining Inverse Probability Weighting and Multiple Imputation to Improve Robustness of Estimation , 2016 .

[19]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[20]  Daniel R. Masys,et al.  Measuring the Quality of Observational Study Data in an International HIV Research Network , 2012, PloS one.

[21]  Daniel Krewski,et al.  A validation sampling approach for consistent estimation of adverse drug reaction risk with misclassified right‐censored survival data , 2018, Statistics in medicine.

[22]  Takumi Saegusa,et al.  WEIGHTED LIKELIHOOD ESTIMATION UNDER TWO-PHASE SAMPLING. , 2011, Annals of statistics.

[23]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[24]  C. Särndal,et al.  Calibration Estimators in Survey Sampling , 1992 .

[25]  Pamela A Shaw,et al.  Connections between Survey Calibration Estimators and Semiparametric Models for Incomplete Data , 2011, International statistical review = Revue internationale de statistique.

[26]  Dipak Kalra,et al.  Cost-benefit assessment of using electronic health records data for clinical research versus current practices: Contribution of the Electronic Health Records for Clinical Research (EHR4CR) European Project. , 2016, Contemporary clinical trials.

[27]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[28]  J. Hughes,et al.  Discrete Proportional Hazards Models for Mismeasured Outcomes , 2003, Biometrics.

[29]  T. Lumley,et al.  Combining multiple imputation with raking of weights in the setting of nearly-true models , 2019, 1910.01162.

[30]  Nilanjan Chatterjee,et al.  Design and analysis of two‐phase studies with binary outcome applied to Wilms tumour prognosis , 1999 .

[31]  Bruce M Psaty,et al.  Use of administrative data to estimate the incidence of statin-related rhabdomyolysis. , 2012, JAMA.

[32]  Chen Tong,et al.  Optimal multi-wave sampling for regression modelling in two-phase designs , 2020 .

[33]  Joy Adamson,et al.  The opportunities and challenges of pragmatic point-of-care randomised trials using routinely collected electronic records: evaluations of two exemplar trials. , 2014, Health technology assessment.

[34]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[35]  Amalia S Magaret,et al.  Incorporating validation subsets into discrete proportional hazards models for mismeasured outcomes , 2008, Statistics in medicine.

[36]  Thomas Lumley,et al.  Improved Horvitz–Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology , 2009, Statistics in biosciences.

[37]  Pamela A Shaw,et al.  EVALUATING RISK-PREDICTION MODELS USING DATA FROM ELECTRONIC HEALTH RECORDS. , 2016, The annals of applied statistics.

[38]  Guanhua Chen,et al.  ACCOUNTING FOR DEPENDENT ERRORS IN PREDICTORS AND TIME-TO-EVENT OUTCOMES USING ELECTRONIC HEALTH RECORDS, VALIDATION SAMPLES, AND MULTIPLE IMPUTATION. , 2020, The annals of applied statistics.

[39]  Thomas Lumley,et al.  Considerations for analysis of time‐to‐event outcomes measured with error: Bias and correction with SIMEX , 2018, Statistics in medicine.

[40]  Pamela A Shaw,et al.  An approximate quasi‐likelihood approach for error‐prone failure time outcomes and exposures , 2020, Statistics in medicine.