Relational machine learning for electronic health record-driven phenotyping

OBJECTIVE Electronic health records (EHR) offer medical and pharmacogenomics research unprecedented opportunities to identify and classify patients at risk. EHRs are collections of highly inter-dependent records that include biological, anatomical, physiological, and behavioral observations. They comprise a patient's clinical phenome, where each patient has thousands of date-stamped records distributed across many relational tables. Development of EHR computer-based phenotyping algorithms require time and medical insight from clinical experts, who most often can only review a small patient subset representative of the total EHR records, to identify phenotype features. In this research we evaluate whether relational machine learning (ML) using inductive logic programming (ILP) can contribute to addressing these issues as a viable approach for EHR-based phenotyping. METHODS Two relational learning ILP approaches and three well-known WEKA (Waikato Environment for Knowledge Analysis) implementations of non-relational approaches (PART, J48, and JRIP) were used to develop models for nine phenotypes. International Classification of Diseases, Ninth Revision (ICD-9) coded EHR data were used to select training cohorts for the development of each phenotypic model. Accuracy, precision, recall, F-Measure, and Area Under the Receiver Operating Characteristic (AUROC) curve statistics were measured for each phenotypic model based on independent manually verified test cohorts. A two-sided binomial distribution test (sign test) compared the five ML approaches across phenotypes for statistical significance. RESULTS We developed an approach to automatically label training examples using ICD-9 diagnosis codes for the ML approaches being evaluated. Nine phenotypic models for each ML approach were evaluated, resulting in better overall model performance in AUROC using ILP when compared to PART (p=0.039), J48 (p=0.003) and JRIP (p=0.003). DISCUSSION ILP has the potential to improve phenotyping by independently delivering clinically expert interpretable rules for phenotype definitions, or intuitive phenotypes to assist experts. CONCLUSION Relational learning using ILP offers a viable approach to EHR-driven phenotyping.

[1]  Brent I. Fox,et al.  Developing an expert panel process to refine health outcome definitions in observational data , 2013, J. Biomed. Informatics.

[2]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[3]  Suchi Saria,et al.  Combining Structured and Free-text Data for Automatic Coding of Patient Outcomes. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[4]  Kypros H. Nicolaides,et al.  Umbilical cord blood erythroblast count as an index of intrauterine hypoxia. , 1994, Archives of disease in childhood. Fetal and neonatal edition.

[5]  Stephen Muggleton Inductive Logic Programming: 6th International Workshop, ILP-96, Stockholm, Sweden, August 26-28, 1996, Selected Papers , 1997 .

[6]  Peggy L. Peissig,et al.  Learning to Predict Post-Hospitalization VTE Risk from EHR Data , 2012, AMIA.

[7]  Sriraam Natarajan,et al.  Statistical Relational Learning to Predict Primary Myocardial Infarction from Electronic Health Records , 2012, IAAI.

[8]  Carl van Walraven,et al.  The accuracy of using integrated electronic health care data to identify patients with undiagnosed diabetes mellitus. , 2012, Journal of evaluation in clinical practice.

[9]  D. Bates,et al.  Electronic health record use and the quality of ambulatory care in the United States. , 2007, Archives of internal medicine.

[10]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  C. D. Page,et al.  Probabilistic computer model developed from clinical data in national mammography database format to classify mammographic findings. , 2009, Radiology.

[13]  Lin Chen,et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records , 2012, J. Am. Medical Informatics Assoc..

[14]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[15]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[16]  Serguei V. S. Pakhomov,et al.  Electronic medical records for clinical research: application to the identification of heart failure. , 2007, The American journal of managed care.

[17]  Munir Pirmohamed,et al.  Pharmacogenomics: the importance of accurate phenotypes. , 2010, Pharmacogenomics.

[18]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[19]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[20]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[21]  Vibha Anand,et al.  An Empirical Validation of Recursive Noisy OR (RNOR) Rule for Asthma Prediction. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[22]  Dingcheng Li,et al.  Using Association Rule Mining for Phenotype Extraction from Electronic Health Records , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[23]  J. Denny,et al.  Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[24]  Norman D. Black,et al.  Feature selection and classification model construction on type 2 diabetic patients' data , 2007, Artif. Intell. Medicine.

[25]  M. Maclure The case-crossover design: a method for studying transient effects on the risk of acute events. , 1991, American journal of epidemiology.

[26]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[27]  S. Džeroski,et al.  Relational Data Mining , 2001, Springer Berlin Heidelberg.

[28]  David J. Burn,et al.  Detecting new neurodegenerative disease genes: does phenotype accuracy limit the horizon? , 2009, Trends in genetics : TIG.

[29]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[30]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[31]  Luc De Raedt,et al.  Logical and Relational Learning: From ILP to MRDM (Cognitive Technologies) , 2008 .

[32]  Jesse Davis,et al.  Demand-Driven Clustering in Relational Domains for Predicting Adverse Drug Events , 2012, ICML.

[33]  E. Ewen,et al.  Electronic health record use to classify patients with newly diagnosed versus preexisting type 2 diabetes: infrastructure for comparative effectiveness research and population health management. , 2012, Population health management.

[34]  Melissa A. Basford,et al.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[35]  David Page,et al.  Graphical-model Based Multiple Testing under Dependence, with Applications to Genome-wide Association Studies , 2012, UAI.

[36]  Sriraam Natarajan,et al.  Identifying Adverse Drug Events by Relational Learning , 2012, AAAI.

[37]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[38]  John F. Hurdle,et al.  Automated identification of adverse events related to central venous catheters , 2007, J. Biomed. Informatics.

[39]  S. Mani,et al.  Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[40]  Melissa A. Basford,et al.  Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. , 2011, American journal of human genetics.

[41]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[42]  Jesse Davis,et al.  Machine Learning for Personalized Medicine : Will This Drug Give Me a Heart Attack ? , 2008 .

[43]  M. Wojczynski,et al.  Definition of phenotype. , 2008, Advances in genetics.