PIE: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data

Abstract Objectives This study proposes a novelPrior knowledge guidedIntegrated likelihoodEstimation (PIE) method to correct bias in estimations of associations due to misclassification of electronic health record (EHR)-derived binary phenotypes, and evaluates the performance of the proposed method by comparing it to 2 methods in common practice. Methods We conducted simulation studies and data analysis of real EHR-derived data on diabetes from Kaiser Permanente Washington to compare the estimation bias of associations using the proposed method, the method ignoring phenotyping errors, the maximum likelihood method with misspecified sensitivity and specificity, and the maximum likelihood method with correctly specified sensitivity and specificity (gold standard). The proposed method effectively leverages available information on phenotyping accuracy to construct a prior distribution for sensitivity and specificity, and incorporates this prior information through the integrated likelihood for bias reduction. Results Our simulation studies and real data application demonstrated that the proposed method effectively reduces the estimation bias compared to the 2 current methods. It performed almost as well as the gold standard method when the prior had highest density around true sensitivity and specificity. The analysis of EHR data from Kaiser Permanente Washington showed that the estimated associations from PIE were very close to the estimates from the gold standard method and reduced bias by 60%–100% compared to the 2 commonly used methods in current practice for EHR data. Conclusions This study demonstrates that the proposed method can effectively reduce estimation bias caused by imperfect phenotyping in EHR-derived data by incorporating prior information through integrated likelihood.

[1]  Clifford Hildreth,et al.  A quadratic programming procedure , 1957 .

[2]  J. Warga Minimizing Certain Convex Functions , 1963 .

[3]  James M. Ortega,et al.  Iterative solution of nonlinear equations in several variables , 2014, Computer science and applied mathematics.

[4]  J. Copas Binary Regression Models for Contaminated Data , 1988 .

[5]  L. Magder,et al.  Logistic regression when the outcome is measured with uncertainty. , 1997, American journal of epidemiology.

[6]  J. Neuhaus Bias and efficiency loss due to misclassified responses in binary regression , 1999 .

[7]  R. Wolpert,et al.  Integrated likelihood methods for eliminating nuisance parameters , 1999 .

[8]  R. Carroll,et al.  Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. , 2001, Statistics in medicine.

[9]  G. Schellenberg,et al.  Dementia and Alzheimer disease incidence: a prospective cohort study. , 2002, Archives of neurology.

[10]  J. Hughes,et al.  Discrete Proportional Hazards Models for Mismeasured Outcomes , 2003, Biometrics.

[11]  Wei Pan,et al.  Does it always help to adjust for misclassification of a binary outcome in logistic regression? , 2005, Statistics in medicine.

[12]  J. Avorn,et al.  A review of uses of health care utilization databases for epidemiologic research on therapeutics. , 2005, Journal of clinical epidemiology.

[13]  Raymond J. Carroll,et al.  Measurement error in nonlinear models: a modern perspective , 2006 .

[14]  T. Severini Integrated likelihood functions for non-Bayesian inference , 2007 .

[15]  R. Platt,et al.  Automated Identification of Acute Hepatitis B Using Electronic Medical Record Data to Facilitate Public Health Surveillance , 2008, PloS one.

[16]  R. Tannen,et al.  Use of primary care electronic medical record database in drug efficacy research on cardiovascular outcomes: comparison of database and randomised controlled trial findings , 2009, BMJ : British Medical Journal.

[17]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[18]  J. Pulley,et al.  Community engagement in biobanking: Experiences from the eMERGE Network , 2010, Genomics, society, and policy.

[19]  Melissa A. Basford,et al.  Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. , 2010, American journal of human genetics.

[20]  Melissa A. Basford,et al.  Identification of Genomic Predictors of Atrioventricular Conduction: Using Electronic Medical Records as a Tool for Genome Science , 2010, Circulation.

[21]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[22]  Melissa A. Basford,et al.  Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. , 2011, American journal of human genetics.

[23]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[24]  C. Brandt,et al.  How Good Are the Data? Feasible Approach to Validation of Metrics of Quality Derived From an Outpatient Electronic Health Record , 2011, American journal of medical quality : the official journal of the American College of Medical Quality.

[25]  S. Navaneethan,et al.  Development and validation of an electronic health record-based chronic kidney disease registry. , 2011, Clinical journal of the American Society of Nephrology : CJASN.

[26]  Jason Wang,et al.  Validity of electronic health record-derived quality measurement for performance monitoring , 2012, J. Am. Medical Informatics Assoc..

[27]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[28]  Jonathan P. Bickel,et al.  The Co-Morbidity Burden of Children and Young Adults with Autism Spectrum Disorders , 2012, PloS one.

[29]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[30]  Pedro J. Caraballo,et al.  Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus , 2012, J. Am. Medical Informatics Assoc..

[31]  Jay R. Desai,et al.  Diabetes and Asthma Case Identification, Validation, and Representativeness When Using Electronic Health Data to Construct Registries for Comparative Effectiveness and Epidemiologic Research , 2012, Medical care.

[32]  S. Haneuse,et al.  Glucose levels and risk of dementia. , 2013, The New England journal of medicine.

[33]  Melissa A. Basford,et al.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[34]  Joshua C. Denny,et al.  A Modular Architecture for Electronic Health Record-Driven Phenotyping , 2015, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[35]  Alisha R Pollastri,et al.  Validation of electronic health record phenotyping of bipolar disorder cases and controls. , 2015, The American journal of psychiatry.

[36]  S. Haneuse,et al.  A General Framework for Considering Selection Bias in EHR-Based Studies: What Data Are Observed and Why? , 2016, EGEMS.

[37]  P. Ellinor,et al.  A Simple and Portable Algorithm for Identifying Atrial Fibrillation in the Electronic Medical Record. , 2016, The American journal of cardiology.

[38]  Jing Huang,et al.  An Empirical Study for Impacts of Measurement Errors on EHR based Association Studies , 2016, AMIA.

[39]  Joshua C. Denny,et al.  Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance , 2016, J. Am. Medical Informatics Assoc..