Development and validation of various phenotyping algorithms for Diabetes Mellitus using data from electronic health records

BACKGROUND AND OBJECTIVE Recent progression towards precision medicine has encouraged the use of electronic health records (EHRs) as a source for large amounts of data, which is required for studying the effect of treatments or risk factors in more specific subpopulations. Phenotyping algorithms allow to automatically classify patients according to their particular electronic phenotype thus facilitating the setup of retrospective cohorts. Our objective is to compare the performance of different classification strategies (only using standardized problems, rule-based algorithms, statistical learning algorithms (six learners) and stacked generalization (five versions)), for the categorization of patients according to their diabetic status (diabetics, not diabetics and inconclusive; Diabetes of any type) using information extracted from EHRs. METHODS Patient information was extracted from the EHR at Hospital Italiano de Buenos Aires, Buenos Aires, Argentina. For the derivation and validation datasets, two probabilistic samples of patients from different years (2005: n = 1663; 2015: n = 800) were extracted. The only inclusion criterion was age (≥40 & <80 years). Four researchers manually reviewed all records and classified patients according to their diabetic status (diabetic: diabetes registered as a health problem or fulfilling the ADA criteria; non-diabetic: not fulfilling the ADA criteria and having at least one fasting glycemia below 126 mg/dL; inconclusive: no data regarding their diabetic status or only one abnormal value). The best performing algorithms within each strategy were tested on the validation set. RESULTS The standardized codes algorithm achieved a Kappa coefficient value of 0.59 (95% CI 0.49, 0.59) in the validation set. The Boolean logic algorithm reached 0.82 (95% CI 0.76, 0.88). A slightly higher value was achieved by the Feedforward Neural Network (0.9, 95% CI 0.85, 0.94). The best performing learner was the stacked generalization meta-learner that reached a Kappa coefficient value of 0.95 (95% CI 0.91, 0.98). CONCLUSIONS The stacked generalization strategy and the feedforward neural network showed the best classification metrics in the validation set. The implementation of these algorithms enables the exploitation of the data of thousands of patients accurately.

[1]  Paul A. Harris,et al.  Desiderata for computable representations of electronic health records-driven phenotype algorithms , 2015, J. Am. Medical Informatics Assoc..

[2]  Kianoush Nazarpour,et al.  Ensemble framework based real-time respiratory motion prediction for adaptive radiotherapy applications. , 2016, Medical engineering & physics.

[3]  Peggy L. Peissig,et al.  Learning to Predict Post-Hospitalization VTE Risk from EHR Data , 2012, AMIA.

[4]  F. Wolf Standards of Medical Care in Diabetes—2014 , 2013, Diabetes Care.

[5]  D. Weir,et al.  Identifying diabetics in Medicare claims and survey data: implications for health services research , 2014, BMC Health Services Research.

[6]  Trevor Hastie,et al.  Model Assessment and Selection , 2009 .

[7]  Gerard Tromp,et al.  Design patterns for the development of electronic health record-driven phenotype extraction algorithms , 2014, J. Biomed. Informatics.

[8]  Richard L Berg,et al.  Use of an Electronic Medical Record for the Identification of Research Subjects with Diabetes Mellitus , 2007, Clinical Medicine & Research.

[9]  Ya Zhang,et al.  A machine learning-based framework to identify type 2 diabetes through electronic health records , 2017, Int. J. Medical Informatics.

[10]  Ioannis P. Vlahavas,et al.  StackTIS: A stacked generalization approach for effective prediction of translation initiation sites , 2012, Comput. Biol. Medicine.

[11]  Shelley A. Rusincovitch,et al.  A comparison of phenotype definitions for diabetes mellitus. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[12]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[13]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[14]  Joshua C. Denny,et al.  Type 2 Diabetes Risk Forecasting from EMR Data using Machine Learning , 2012, AMIA.

[15]  Yang Liu,et al.  Combining integrated sampling with SVM ensembles for learning from imbalanced datasets , 2011, Inf. Process. Manag..

[16]  JoAnn E Manson,et al.  Accuracy of Administrative Coding for Type 2 Diabetes in Children, Adolescents, and Young Adults , 2007, Diabetes Care.

[17]  Karen Tu,et al.  Diabetics can be identified in an electronic medical record using laboratory tests and prescriptions. , 2011, Journal of clinical epidemiology.

[18]  George Hripcsak,et al.  EHR-based phenotyping: Bulk learning and evaluation , 2017, J. Biomed. Informatics.

[19]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[20]  N. Clark,et al.  Standards of Medical Care in Diabetes: Response to Power , 2006 .

[21]  Jing Liu,et al.  An ensemble method for extracting adverse drug events from social media , 2016, Artif. Intell. Medicine.

[22]  Joshua C Denny,et al.  Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals , 2017, J. Am. Medical Informatics Assoc..

[23]  George Hripcsak,et al.  A collaborative approach to developing an electronic health record phenotyping algorithm for drug-induced liver injury. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[24]  Kazuhiko Ohe,et al.  Development of Type 2 Diabetes Mellitus Phenotyping Framework Using Expert Knowledge and Machine Learning Approach , 2017, Journal of diabetes science and technology.

[25]  T. To,et al.  Validation of a health administrative data algorithm for assessing the epidemiology of diabetes in Canadian children , 2010, Pediatric diabetes.

[26]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[27]  Mark J. van der Laan,et al.  Optimal Spatial Prediction Using Ensemble Machine Learning , 2016, The international journal of biostatistics.

[28]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[29]  J. Fradkin,et al.  NIH Precision Medicine Initiative: Implications for Diabetes Research , 2016, Diabetes Care.

[30]  Paul A. Harris,et al.  PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability , 2016, J. Am. Medical Informatics Assoc..

[31]  Eneida A. Mendonça,et al.  Relational machine learning for electronic health record-driven phenotyping , 2014, J. Biomed. Informatics.

[32]  William K. Thompson,et al.  Automatically detecting problem list omissions of type 2 diabetes cases using electronic medical records. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[33]  H. Quan,et al.  Validating ICD coding algorithms for diabetes mellitus from administrative data. , 2010, Diabetes research and clinical practice.

[34]  Seppe K. L. M. vanden Broucke,et al.  Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data , 2011, J. Biomed. Informatics.

[35]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[36]  S. Rajpathak,et al.  Incidence and prevalence of diabetes mellitus in the Americas. , 2001, Revista panamericana de salud publica = Pan American journal of public health.

[37]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[38]  Guergana K. Savova,et al.  Semi-supervised Learning for Phenotyping Tasks , 2015, AMIA.

[39]  Melissa A. Basford,et al.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[40]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[41]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[42]  Jeyakumar Natarajan,et al.  Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases , 2016, J. Biomed. Informatics.

[43]  I. Kohane,et al.  Development of phenotype algorithms using electronic medical records and incorporating natural language processing , 2015, BMJ : British Medical Journal.

[44]  I. Kohane,et al.  Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts , 2015, PloS one.

[45]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[46]  L. Breiman Stacked Regressions , 1996, Machine Learning.

[47]  George Hripcsak,et al.  Development and validation of an electronic phenotyping algorithm for chronic kidney disease , 2014, AMIA.

[48]  Marylyn D. Ritchie,et al.  Knowledge-Driven Multi-Locus Analysis Reveals Gene-Gene Interactions Influencing HDL Cholesterol Level in Two Independent EMR-Linked Biobanks , 2011, PloS one.

[49]  R. Platt,et al.  Automated Detection and Classification of Type 1 Versus Type 2 Diabetes Using Electronic Health Record Data , 2013, Diabetes Care.

[50]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[51]  Sungroh Yoon,et al.  Ensemble learning can significantly improve human microRNA target prediction. , 2014, Methods.

[52]  Jin Fan,et al.  Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease , 2010, J. Am. Medical Informatics Assoc..

[53]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[54]  Jennifer G. Robinson,et al.  Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[55]  May D. Wang,et al.  Integration of multi-modal biomedical data to predict cancer grade and patient survival , 2016, 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI).

[56]  Özlem Uzuner,et al.  A systematic comparison of feature space effects on disease classifier performance for phenotype identification of five diseases , 2015, J. Biomed. Informatics.

[57]  Cui Tao,et al.  Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: The SHARPn project , 2012, J. Biomed. Informatics.

[58]  Lin Chen,et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records , 2012, J. Am. Medical Informatics Assoc..

[59]  Jay R. Desai,et al.  Construction of a Multisite DataLink Using Electronic Health Records for the Identification, Surveillance, Prevention, and Management of Diabetes Mellitus: The SUPREME-DM Project , 2012, Preventing chronic disease.

[60]  I. Kohane,et al.  Instrumenting the health care enterprise for discovery research in the genomic era. , 2009, Genome research.

[61]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[62]  Hongfei Lin,et al.  Extracting Drug-Drug Interaction from the Biomedical Literature Using a Stacked Generalization-Based Approach , 2013, PloS one.

[63]  Ghalib A. Bello,et al.  Development and Validation of a Clinical Risk-Assessment Tool Predictive of All-Cause Mortality , 2015, Bioinformatics and biology insights.

[64]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[65]  Casey S. Greene,et al.  Semi-supervised learning of the electronic health record for phenotype stratification , 2016, J. Biomed. Informatics.