ACCOUNTING FOR DEPENDENT ERRORS IN PREDICTORS AND TIME-TO-EVENT OUTCOMES USING ELECTRONIC HEALTH RECORDS, VALIDATION SAMPLES, AND MULTIPLE IMPUTATION.

Data from electronic health records (EHR) are prone to errors, which are often correlated across multiple variables. The error structure is further complicated when analysis variables are derived as functions of two or more error-prone variables. Such errors can substantially impact estimates, yet we are unaware of methods that simultaneously account for errors in covariates and time-to-event outcomes. Using EHR data from 4217 patients, the hazard ratio for an AIDS-defining event associated with a 100 cell/mm3 increase in CD4 count at ART initiation was 0.74 (95%CI: 0.68-0.80) using unvalidated data and 0.60 (95%CI: 0.53-0.68) using fully validated data. Our goal is to obtain unbiased and efficient estimates after validating a random subset of records. We propose fitting discrete failure time models to the validated subsample and then multiply imputing values for unvalidated records. We demonstrate how this approach simultaneously addresses dependent errors in predictors, time-to-event outcomes, and inclusion criteria. Using the fully validated dataset as a gold standard, we compare the mean squared error of our estimates with those from the unvalidated dataset and the corresponding subsample-only dataset for various subsample sizes. By incorporating reasonably sized validated subsamples and appropriate imputation models, our approach had improved estimation over both the naive analysis and the analysis using only the validation subsample.

[1]  Lori E Dodd,et al.  Measurement error in the timing of events: effect on survival analyses in randomized clinical trials , 2010, Clinical trials.

[2]  Bryan E Shepherd,et al.  Accounting for Data Errors Discovered from an Audit in Multiple Linear Regression , 2011, Biometrics.

[3]  Anastasios A. Tsiatis,et al.  A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error , 2001 .

[4]  B. Efron Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve , 1988 .

[5]  Gerhard Tutz,et al.  Modeling Discrete Time-To-Event Data , 2016 .

[6]  Lin,et al.  Functional Inference in Frailty Measurement Error Models for Clustered Survival Data Using the SIMEX Approach , 2006 .

[7]  P. Turchin Quantitative analysis of movement : measuring and modeling population redistribution in animals and plants , 1998 .

[8]  Stephen R Cole,et al.  Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. , 2013, American journal of epidemiology.

[9]  M. McIsaac,et al.  Statistical methods for incomplete data: Some results on model misspecification , 2017, Statistical methods in medical research.

[10]  M. Wulfsohn,et al.  A joint model for survival and longitudinal data measured with error. , 1997, Biometrics.

[11]  L. Dodd,et al.  Using audit information to adjust parameter estimates for data errors in clinical trials , 2012, Clinical trials.

[12]  Pamela A Shaw,et al.  Connections between Survey Calibration Estimators and Semiparametric Models for Incomplete Data , 2011, International statistical review = Revue internationale de statistique.

[13]  R. Prentice Covariate measurement errors and parameter estimation in a failure time regression model , 1982 .

[14]  Amalia S Magaret,et al.  Incorporating validation subsets into discrete proportional hazards models for mismeasured outcomes , 2008, Statistics in medicine.

[15]  Yijian Huang,et al.  Cox Regression with Accurate Covariates Unascertainable: A Nonparametric-Correction Approach , 2000 .

[16]  S. Haneuse,et al.  On the Assessment of Monte Carlo Error in Simulation-Based Statistical Analyses , 2009, The American statistician.

[17]  Brett T McClintock,et al.  When to be discrete: the importance of time formulation in understanding animal movement , 2014, Movement Ecology.

[18]  R B D'Agostino,et al.  Relation of pooled logistic regression to time dependent Cox regression analysis: the Framingham Heart Study. , 1990, Statistics in medicine.

[19]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[20]  J. R. Cook,et al.  Simulation-Extrapolation Estimation in Parametric Measurement Error Models , 1994 .

[21]  K Humphreys,et al.  Weibull Regression for Lifetimes Measured with Error , 1999, Lifetime data analysis.

[22]  Thomas Lumley,et al.  Considerations for analysis of time‐to‐event outcomes measured with error: Bias and correction with SIMEX , 2018, Statistics in medicine.

[23]  Daniel R. Masys,et al.  Measuring the Quality of Observational Study Data in an International HIV Research Network , 2012, PloS one.

[24]  J. Robins,et al.  Marginal Structural Models to Estimate the Joint Causal Effect of Nonrandomized Treatments , 2001 .

[25]  Tsuyoshi Nakamura Corrected score function for errors-in-variables models : Methodology and application to generalized linear models , 1990 .

[26]  Bruce M Psaty,et al.  Use of administrative data to estimate the incidence of statin-related rhabdomyolysis. , 2012, JAMA.

[27]  Patrick Royston,et al.  The design of simulation studies in medical statistics , 2006, Statistics in medicine.

[28]  M. Kenward,et al.  A comparison of multiple imputation and doubly robust estimation for analyses with missing data , 2006 .

[29]  L. Dodd,et al.  measurements with and without diagnostic error Analysis of progression-free survival data using a discrete time survival model that incorporates , 2010 .

[30]  B. Richardson,et al.  Product limit estimation for infectious disease data when the diagnostic test for the outcome is measured with uncertainty. , 2000, Biostatistics.

[31]  R. Prentice,et al.  Hazard Ratio Estimation for Biomarker‐Calibrated Dietary Exposures , 2012, Biometrics.

[32]  J. Robins,et al.  Inference for imputation estimators , 2000 .