A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients

OBJECTIVE Phenotyping patients using electronic health record (EHR) data conventionally requires labeled cases and controls. Assigning labels requires manual medical chart review and therefore is labor intensive. For some phenotypes, identifying gold-standard controls is prohibitive. We developed an accurate EHR phenotyping approach that does not require labeled controls. MATERIALS AND METHODS Our framework relies on a random subset of cases, which can be specified using an anchor variable that has excellent positive predictive value and sensitivity independent of predictors. We proposed a maximum likelihood approach that efficiently leverages data from the specified cases and unlabeled patients to develop logistic regression phenotyping models, and compare model performance with existing algorithms. RESULTS Our method outperformed the existing algorithms on predictive accuracy in Monte Carlo simulation studies, application to identify hypertension patients with hypokalemia requiring oral supplementation using a simulated anchor, and application to identify primary aldosteronism patients using real-world cases and anchor variables. Our method additionally generated consistent estimates of 2 important parameters, phenotype prevalence and the proportion of true cases that are labeled. DISCUSSION Upon identification of an anchor variable that is scalable and transferable to different practices, our approach should facilitate development of scalable, transferable, and practice-specific phenotyping models. CONCLUSIONS Our proposed approach enables accurate semiautomated EHR phenotyping with minimal manual labeling and therefore should greatly facilitate EHR clinical decision support and research.

[1]  E. Hing,et al.  Use and characteristics of electronic health record systems among office-based physician practices: United States, 2001-2012. , 2012, NCHS data brief.

[2]  Nigam H. Shah,et al.  Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network , 2017, CRI.

[3]  David Sontag,et al.  Using Anchors to Estimate Clinical State without Labeled Data , 2014, AMIA.

[4]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[5]  Peter Szolovits,et al.  Surrogate-assisted feature extraction for high-throughput phenotyping , 2016, J. Am. Medical Informatics Assoc..

[6]  D. Fraker,et al.  Role of adrenal vein sampling in primary aldosteronism: Impact of imaging, localization, and age , 2016, Journal of surgical oncology.

[7]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  E. Porteri,et al.  A prospective study of the prevalence of primary aldosteronism in 1,125 hypertensive patients. , 2006, Journal of the American College of Cardiology.

[9]  John P. A. Ioannidis,et al.  Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review , 2017, J. Am. Medical Informatics Assoc..

[10]  H. Okayama,et al.  Left ventricular hypertrophy precedes other target-organ damage in primary aldosteronism. , 1997, Hypertension.

[11]  Nigam H. Shah,et al.  Learning statistical models of phenotypes using noisy labeled training data , 2016, J. Am. Medical Informatics Assoc..

[12]  F. Veglio,et al.  Prevalence and Clinical Manifestations of Primary Aldosteronism Encountered in Primary Care Practice. , 2017, Journal of the American College of Cardiology.

[13]  S. Skeie,et al.  Shared Electronic Health Record Systems: Key Legal and Security Challenges , 2017, Journal of diabetes science and technology.

[14]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[15]  T. Cai,et al.  Semi‐supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping , 2019, Biometrics.

[16]  J. Pathak,et al.  Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[17]  Geppino Pucci,et al.  Heterogeneous machine learning system for improving the diagnosis of primary aldosteronism , 2015, Pattern Recognit. Lett..

[18]  J. Lenders,et al.  Prevalence of primary aldosteronism in primary care: a cross-sectional study. , 2018, The British journal of general practice : the journal of the Royal College of General Practitioners.

[19]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[20]  Michel Ducher,et al.  Reliability of a Bayesian network to predict an elevated aldosterone-to-renin ratio. , 2015, Archives of cardiovascular diseases.

[21]  D. Schlossman,et al.  Have Electronic Health Records Improved the Quality of Patient Care? , 2017, PM & R : the journal of injury, function, and rehabilitation.

[22]  David Sontag,et al.  Electronic medical record phenotyping using the anchor and learn framework , 2016, J. Am. Medical Informatics Assoc..

[23]  Dean F. Sittig,et al.  Implementing electronic health records (EHRs): health care provider perceptions before and after transition from a local basic EHR to a commercial comprehensive EHR , 2018, J. Am. Medical Informatics Assoc..

[24]  Vipin Kumar,et al.  Mining Electronic Health Records: A Survey , 2017, 1702.03222.

[25]  R. Stafford,et al.  Electronic health records and clinical decision support systems: impact on national ambulatory care quality. , 2011, Archives of internal medicine.

[26]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[27]  T. Hastie,et al.  Presence‐Only Data and the EM Algorithm , 2009, Biometrics.

[28]  P. Palatini,et al.  Changes in left ventricular anatomy and function in hypertension and primary aldosteronism. , 1996, Hypertension.

[29]  G. Chatellier,et al.  Left ventricular mass and geometry before and after etiologic treatment in renovascular hypertension, aldosterone-producing adenoma, and pheochromocytoma. , 1993, American journal of hypertension.

[30]  A. Semplicini,et al.  Screening for primary aldosteronism with a logistic multivariate discriminant analysis * , 1998, Clinical endocrinology.