Accounting for misclassification in electronic health records-derived exposures using generalized linear finite mixture models

Exposures derived from electronic health records (EHR) may be misclassified, leading to biased estimates of their association with outcomes of interest. An example of this problem arises in the context of cancer screening where test indication, the purpose for which a test was performed, is often unavailable. This poses a challenge to understanding the effectiveness of screening tests because estimates of screening test effectiveness are biased if some diagnostic tests are misclassified as screening. Prediction models have been developed for a variety of exposure variables that can be derived from EHR, but no previous research has investigated appropriate methods for obtaining unbiased association estimates using these predicted probabilities. The full likelihood incorporating information on both the predicted probability of exposure-class membership and the association between the exposure and outcome of interest can be expressed using a finite mixture model. When the regression model of interest is a generalized linear model (GLM), the expectation–maximization algorithm can be used to estimate the parameters using standard software for GLMs. Using simulation studies, we compared the bias and efficiency of this mixture model approach to alternative approaches including multiple imputation and dichotomization of the predicted probabilities to create a proxy for the missing predictor. The mixture model was the only approach that was unbiased across all scenarios investigated. Finally, we explored the performance of these alternatives in a study of colorectal cancer screening with colonoscopy. These findings have broad applicability in studies using EHR data where gold-standard exposures are unavailable and prediction models have been developed for estimating proxies.

[1]  Daniel F McCaffrey,et al.  Power of tests for a dichotomous independent variable measured with error. , 2008, Health services research.

[2]  I. Kohane,et al.  Improving Case Definition of Crohn's Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach , 2013, Inflammatory bowel diseases.

[3]  Ritsert C. Jansen,et al.  Maximum Likelihood in a Generalized Linear Finite Mixture Model by Using the EM Algorithm , 1993 .

[4]  J. Cha,et al.  Impact of Sigmoidoscopy and Colonoscopy on Colorectal Cancer Incidence and Mortality: An Evidence-Based Review of Published Prospective and Retrospective Studies , 2014, Intestinal research.

[5]  F. V. von Eyben,et al.  Colorectal cancer screening: clinical guidelines and rationale. , 1997, Gastroenterology.

[6]  J. Stockman,et al.  Complications of Colonoscopy in an Integrated Health Care Delivery System , 2008 .

[7]  C. Mulrow,et al.  Colorectal cancer screening: clinical guidelines and rationale. , 1997, Gastroenterology.

[8]  Lynette A. Hunt,et al.  Mixture model clustering for mixed data with missing information , 2003, Comput. Stat. Data Anal..

[9]  Robyn Tamblyn,et al.  The Incidence and Determinants of Primary Nonadherence With Prescribed Medication in Primary Care , 2014, Annals of Internal Medicine.

[10]  Bruce Burchett,et al.  Substance use disorders and comorbid Axis I and II psychiatric disorders among young psychiatric patients: findings from a large electronic health records database. , 2011, Journal of psychiatric research.

[11]  H. Brenner,et al.  Effect of screening sigmoidoscopy and screening colonoscopy on colorectal cancer incidence and mortality: systematic review and meta-analysis of randomised controlled trials and observational studies , 2014, BMJ : British Medical Journal.

[12]  A. Jemal,et al.  Colorectal cancer statistics, 2014 , 2014, CA: a cancer journal for clinicians.

[13]  N. Weiss,et al.  Approaches to the analysis of case-control studies of the efficacy of screening for cancer. , 1992, American journal of epidemiology.

[14]  J. Kelsey,et al.  Inconsistencies between self-reported ethnicity and ethnicity recorded in a health maintenance organization. , 2005, Annals of epidemiology.

[15]  Chyke A Doubeni,et al.  Development of an Algorithm to Classify Colonoscopy Indication from Coded Health Care Data , 2015, EGEMS.

[16]  Cynthia J. Coffman,et al.  Ascertainment of Colonoscopy Indication Using Administrative Data , 2010, Digestive Diseases and Sciences.

[17]  U. Boehmer,et al.  Self-reported vs administrative race/ethnicity data and study results. , 2002, American journal of public health.

[18]  Bernadette Mazurek Melnyk,et al.  Screening for colorectal cancer: U.S. Preventive Services Task Force recommendation statement. , 2008, Annals of internal medicine.

[19]  Jimeng Sun,et al.  Predicting changes in hypertension control using electronic health records from a chronic disease management program , 2014, J. Am. Medical Informatics Assoc..

[20]  N. Weiss Analysis of case-control studies of the efficacy of screening for cancer: How should we deal with tests done in persons with symptoms? , 1998, American journal of epidemiology.

[21]  T. Thompson,et al.  Finite mixture models with concomitant information: assessing diagnostic criteria for diabetes , 2002 .

[22]  Rebecca L. Siegel Mph,et al.  Colorectal cancer statistics, 2014 , 2014 .

[23]  M. Phipps,et al.  Screening for Colorectal Cancer: US Preventive Services Task Force Recommendation Statement. , 2016, JAMA.

[24]  Sarah M. Greene,et al.  Race and ethnicity: comparing medical records to self-reports. , 2005, Journal of the National Cancer Institute. Monographs.

[25]  D. McCaffrey,et al.  Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities , 2009, Health Services and Outcomes Research Methodology.

[26]  Søren Feodor Nielsen,et al.  1. Statistical Analysis with Missing Data (2nd edn). Roderick J. Little and Donald B. Rubin, John Wiley & Sons, New York, 2002. No. of pages: xv+381. ISBN: 0‐471‐18386‐5 , 2004 .

[27]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[28]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[29]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[30]  Jeroen K. Vermunt,et al.  Latent Class Modeling with Covariates: Two Improved Three-Step Approaches , 2010, Political Analysis.

[31]  G. Cooper,et al.  The use of screening colonoscopy for patients cared for by the Department of Veterans Affairs. , 2006, Archives of internal medicine.

[32]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .