A review of approaches to identifying patient phenotype cohorts using electronic health records

Objective To summarize literature describing approaches aimed at automatically identifying patients with a common phenotype. Materials and methods We performed a review of studies describing systems or reporting techniques developed for identifying cohorts of patients with specific phenotypes. Every full text article published in (1) Journal of American Medical Informatics Association, (2) Journal of Biomedical Informatics, (3) Proceedings of the Annual American Medical Informatics Association Symposium, and (4) Proceedings of Clinical Research Informatics Conference within the past 3 years was assessed for inclusion in the review. Only articles using automated techniques were included. Results Ninety-seven articles met our inclusion criteria. Forty-six used natural language processing (NLP)-based techniques, 24 described rule-based systems, 41 used statistical analyses, data mining, or machine learning techniques, while 22 described hybrid systems. Nine articles described the architecture of large-scale systems developed for determining cohort eligibility of patients. Discussion We observe that there is a rise in the number of studies associated with cohort identification using electronic medical records. Statistical analyses or machine learning, followed by NLP techniques, are gaining popularity over the years in comparison with rule-based systems. Conclusions There are a variety of approaches for classifying patients into a particular phenotype. Different techniques and data sources are used, and good performance is reported on datasets at respective institutions. However, no system makes comprehensive use of electronic medical records addressing all of their known weaknesses.

[1]  Shuying Shen,et al.  Analysis of False Positive Errors of an Acute Respiratory Infection Text Classifier due to Contextual Features , 2010, Summit on translational bioinformatics.

[2]  Anthony N. Nguyen,et al.  Application of Information Technology: Collection of Cancer Stage Data by Classifying Free-text Medical Reports , 2007, J. Am. Medical Informatics Assoc..

[3]  Peter J. Embi,et al.  Development of an Electronic Health Record-based Clinical Trial Alert System to Enhance Recruitment at the Point of Care , 2005, AMIA.

[4]  L. Penberthy,et al.  Automated matching software for clinical trials eligibility: measuring efficiency and flexibility. , 2010, Contemporary clinical trials.

[5]  Joe Kesterson,et al.  Comparing methods for identifying pancreatic cancer patients using electronic data sources. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[6]  Sahana Murthy,et al.  Modeling and Executing Electronic Health Records Driven Phenotyping Algorithms using the NQF Quality Data Model and JBoss® Drools Engine , 2012, AMIA.

[7]  Leonard W. D'Avolio,et al.  Evaluation of a generalizable approach to clinical information retrieval using the automated retrieval console (ARC) , 2010, J. Am. Medical Informatics Assoc..

[8]  George Hripcsak,et al.  Automated encoding of clinical documents based on natural language processing. , 2004, Journal of the American Medical Informatics Association : JAMIA.

[9]  J. Denny,et al.  Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[10]  David C. Kaelber,et al.  Patient characteristics associated with venous thromboembolic events: a cohort study using pooled electronic health record data , 2012, J. Am. Medical Informatics Assoc..

[11]  B. Yawn,et al.  Identifying Persons with Diabetes Using Medicare Claims Data , 1999, American journal of medical quality : the official journal of the American College of Medical Quality.

[12]  Atul J. Butte,et al.  Enrolling patients into clinical trials faster using RealTime Recuiting , 2000, AMIA.

[13]  Peggy L. Peissig,et al.  Learning to Predict Post-Hospitalization VTE Risk from EHR Data , 2012, AMIA.

[14]  Spencer E. Harpe,et al.  Use of International Classification of Diseases, Ninth Revision Clinical Modification Codes and Medication Use Data to Identify Nosocomial Clostridium difficile Infection , 2009, Infection Control & Hospital Epidemiology.

[15]  S. Tu,et al.  A Methodology for Determining Patients’ Eligibility for Clinical Trials , 1993, Methods of Information in Medicine.

[16]  Di Zhao,et al.  Combining PubMed knowledge and EHR data to develop a weighted bayesian network for pancreatic cancer prediction , 2011, J. Biomed. Informatics.

[17]  Suzette J. Bielinski,et al.  Mining the Human Phenome using Semantic Web Technologies: A Case Study for Type 2 Diabetes , 2012, AMIA.

[18]  Alan J. Forster,et al.  A systematic review to evaluate the accuracy of electronic adverse drug event detection , 2012, J. Am. Medical Informatics Assoc..

[19]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[20]  Hongfang Liu,et al.  A Study of Transportability of an Existing Smoking Status Detection Module across Institutions , 2012, AMIA.

[21]  Theodoros N. Arvanitis,et al.  Cohort Identification for Clinical Research: Querying Federated Electronic Healthcare Records Using Controlled Vocabularies and Semantic Types , 2012, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[22]  Sunghwan Sohn,et al.  Mayo Clinic Smoking Status Classification System: Extensions and Improvements , 2009, AMIA.

[23]  Mohammed Saeed,et al.  Risk Stratification of ICU Patients Using Topic Models Inferred from Unstructured Progress Notes , 2012, AMIA.

[24]  R. Platt,et al.  Automated Identification of Acute Hepatitis B Using Electronic Medical Record Data to Facilitate Public Health Surveillance , 2008, PloS one.

[25]  Hongyuan Zha,et al.  Utility of a Clinical Support Tool for Outpatient Evaluation of Pediatric Chest Pain , 2012, AMIA.

[26]  D E Price,et al.  Use of a patient linked data warehouse to facilitate diabetes trial recruitment from primary care. , 2009, Primary care diabetes.

[27]  Jihoon Kim,et al.  Selecting Cases for Whom Additional Tests Can Improve Prognostication , 2012, AMIA.

[28]  Ju Han Kim,et al.  Synergistic effect of different levels of genomic data for cancer clinical outcome prediction , 2012, J. Biomed. Informatics.

[29]  Mary F. Wisniewski,et al.  Computer Algorithms To Detect Bloodstream Infections , 2004, Emerging infectious diseases.

[30]  Joshua C. Denny,et al.  An Evaluation of the NQF Quality Data Model for Representing Electronic Health Record Driven Phenotyping Algorithms , 2012, AMIA.

[31]  Shuying Shen,et al.  Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure , 2012, J. Am. Medical Informatics Assoc..

[32]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[33]  M. Samore,et al.  Combining Free Text and Structured Electronic Medical Record Entries to Detect Acute Respiratory Infections , 2010, PloS one.

[34]  Eva K. Lee,et al.  A Clinical Decision Tool for Predicting Patient Care Characteristics: Patients returning within 72 Hours in the Emergency Department , 2012, AMIA.

[35]  Joel D. Martin,et al.  Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 , 2011, J. Am. Medical Informatics Assoc..

[36]  Peter J. Haug,et al.  A Comparison of Classification Algorithms to Automatically Identify Chest X-Ray Reports That Support Pneumonia , 2001, J. Biomed. Informatics.

[37]  Abel N. Kho,et al.  A Highly Specific Algorithm for Identifying Asthma Cases and Controls for Genome-Wide Association Studies , 2009, AMIA.

[38]  Christopher G. Chute,et al.  Evaluating Phenotypic Data Elements for Genetics and Epidemiological Research: Experiences from the eMERGE and PhenX Network Projects , 2011, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[39]  R. Lazarus,et al.  Viewpoint Paper: Electronic Support for Public Health: Validated Case Finding and Reporting for Notifiable Diseases Using Electronic Medical Data , 2009, J. Am. Medical Informatics Assoc..

[40]  Karthik Gomadam,et al.  TrialX: Using semantic technologies to match patients to relevant clinical trials based on their Personal Health Records , 2010, J. Web Semant..

[41]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[42]  Joshua C. Denny,et al.  Type 2 Diabetes Risk Forecasting from EMR Data using Machine Learning , 2012, AMIA.

[43]  Thomas E Yankeelov,et al.  Early prediction of the response of breast tumors to neoadjuvant chemotherapy using quantitative MRI and machine learning. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[44]  Isaac S. Kohane,et al.  DITTO - a Tool for Identification of Patient Cohorts from the Text of Physician Notes in the Electronic Medical Record , 2005, AMIA.

[45]  Min-Soo Kim,et al.  Decision-making model for early diagnosis of congestive heart failure using rough set and decision tree approaches , 2012, J. Biomed. Informatics.

[46]  Christopher G Chute,et al.  Analyzing the heterogeneity and complexity of Electronic Health Record oriented phenotyping algorithms. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[47]  Özlem Uzuner,et al.  Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[48]  Peter J. Haug,et al.  Early Detection of Sepsis in the Emergency Department using Dynamic Bayesian Networks , 2012, AMIA.

[49]  Joshua Jones,et al.  Detecting pregnancy use of non-hormonal category X medications in electronic medical records , 2011, J. Am. Medical Informatics Assoc..

[50]  Trevor Cohen,et al.  Graph-based signal integration for high-throughput phenotyping , 2012, BMC Bioinformatics.

[51]  Seppe K. L. M. vanden Broucke,et al.  Data Mining Methods for Classification of Medium-Chain ACYL-COA Dehydrogenase Deficiency (MCADD) using Non-Derivatized Tandem Ms Neonatal Screening Data , 2011, Pediatric Research.

[52]  Pedro J. Caraballo,et al.  Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus , 2012, J. Am. Medical Informatics Assoc..

[53]  Carol Friedman,et al.  Identification of findings suspicious for breast cancer based on natural language processing of mammogram reports , 1997, AMIA.

[54]  Peter J. Haug,et al.  Research Paper: Automatic Detection of Acute Bacterial Pneumonia from Chest X-ray Reports , 2000, J. Am. Medical Informatics Assoc..

[55]  Richard L Berg,et al.  Use of an Electronic Medical Record for the Identification of Research Subjects with Diabetes Mellitus , 2007, Clinical Medicine & Research.

[56]  James W. Cooper,et al.  Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model , 2009, J. Biomed. Informatics.

[57]  Guo-Qiang Zhang,et al.  EpiDEA: Extracting Structured Epilepsy and Seizure Information from Patient Discharge Summaries for Cohort Identification , 2012, AMIA.

[58]  Catalina Martínez-Costa,et al.  Using the ResearchEHR platform to facilitate the practical application of the EHR standards , 2012, J. Biomed. Informatics.

[59]  Michael Brady,et al.  Survival Prediction and Treatment Recommendation with Bayesian Techniques in Lung Cancer , 2012, AMIA.

[60]  David W. Bates,et al.  A method and knowledge base for automated inference of patient problems from structured data in an electronic medical record , 2011, J. Am. Medical Informatics Assoc..

[61]  Lin Chen,et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records , 2012, J. Am. Medical Informatics Assoc..

[62]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[63]  Christopher G. Chute,et al.  An analytical approach to characterize morbidity profile dissimilarity between distinct cohorts using electronic medical records , 2010, J. Biomed. Informatics.

[64]  Chun Wei Yap,et al.  Development of a combined system for identification and classification of adverse drug reactions: Alerts Based on ADR Causality and Severity (ABACUS) , 2010, J. Am. Medical Informatics Assoc..

[65]  Fei Wang,et al.  Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records , 2012, AMIA.

[66]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[67]  Davera Gabriel,et al.  Implementation of a deidentified federated data network for population-based cohort discovery , 2012, J. Am. Medical Informatics Assoc..

[68]  Cui Tao,et al.  Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: The SHARPn project , 2012, J. Biomed. Informatics.

[69]  Jayashree Kalpathy-Cramer,et al.  Parametric survival models for predicting the benefit of adjuvant chemoradiotherapy in gallbladder cancer. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[70]  Jasvinder A Singh,et al.  Accuracy of Veterans Administration databases for a diagnosis of rheumatoid arthritis. , 2004, Arthritis and rheumatism.

[71]  Li Li,et al.  Comparing ICD9-Encoded Diagnoses and NLP-Processed Discharge Summaries for Clinical Trials Pre-Screening: A Case Study , 2008, AMIA.

[72]  N L Jain,et al.  Identification of suspected tuberculosis patients based on natural language processing of chest radiograph reports. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[73]  B. Gage,et al.  Accuracy of ICD-9-CM Codes for Identifying Cardiovascular and Stroke Risk Factors , 2005, Medical care.

[74]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[75]  Cosmin Adrian Bejan,et al.  Assessing Pneumonia Identification from Time-Ordered Narrative Reports , 2012, AMIA.

[76]  Bethany Percha,et al.  Automatic classification of mammography reports by BI-RADS breast tissue composition class , 2012, J. Am. Medical Informatics Assoc..

[77]  Graham A Colditz,et al.  Validation of the Harvard Cancer Risk Index: a prediction tool for individual cancer risk. , 2004, Journal of clinical epidemiology.

[78]  Mohammad R. Akbarzadeh-Totonchi,et al.  Fuzzy-probabilistic multi agent system for breast cancer risk assessment and insurance premium assignment , 2012, J. Biomed. Informatics.

[79]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[80]  Elizabeth S. Chen,et al.  Determining Compound Comorbidities for Heart Failure from Hospital Discharge Data , 2012, AMIA.

[81]  Kenneth D. Mandl,et al.  Research paper: Use of population health data to refine diagnostic decision-making for pertussis , 2010, J. Am. Medical Informatics Assoc..

[82]  Sunghwan Sohn,et al.  Drug side effect extraction from clinical narratives of psychiatry and psychology patients , 2011, J. Am. Medical Informatics Assoc..

[83]  Bruce E. Bray,et al.  A bootstrapping algorithm to improve cohort identification using structured data , 2011, J. Biomed. Informatics.

[84]  S. Mani,et al.  Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[85]  Christopher G. Chute,et al.  Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the eMERGE Network experience , 2011, J. Am. Medical Informatics Assoc..

[86]  David W. Baker,et al.  Use of electronic health record data to evaluate overuse of cervical cancer screening , 2012, J. Am. Medical Informatics Assoc..

[87]  Justin Starren,et al.  Evaluation of the google search appliance for patient cohort discovery. , 2008, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[88]  Serguei V. S. Pakhomov,et al.  Electronic medical records for clinical research: application to the identification of heart failure. , 2007, The American journal of managed care.

[89]  Alex A. T. Bui,et al.  Comparing Predictive Models of Glioblastoma Multiforme Built Using Multi-Institutional and Local Data Sources , 2012, AMIA.

[90]  Cosmin Adrian Bejan,et al.  Pneumonia identification using statistical feature selection , 2012, J. Am. Medical Informatics Assoc..

[91]  Melissa A. Basford,et al.  Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. , 2010, American journal of human genetics.

[92]  D M Roden,et al.  Electronic Medical Records as a Tool in Clinical Pharmacology: Opportunities and Challenges , 2012, Clinical pharmacology and therapeutics.

[93]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[94]  William Gilbert,et al.  The design and implementation of an open-source, data-driven cohort recruitment system: the Duke Integrated Subject Cohort and Enrollment Research Network (DISCERN) , 2012, J. Am. Medical Informatics Assoc..

[95]  Christopher G. Chute,et al.  Using Semantic Web Technologies for Cohort Identification from Electronic Health Records for Clinical Research , 2012, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[96]  Scott T. Weiss,et al.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system , 2006, BMC Medical Informatics Decis. Mak..

[97]  Munir Pirmohamed,et al.  Pharmacogenomics: the importance of accurate phenotypes. , 2010, Pharmacogenomics.

[98]  Hua Xu,et al.  Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs , 2012, J. Am. Medical Informatics Assoc..

[99]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[100]  Anthony N. Nguyen,et al.  Symbolic rule-based classification of lung cancer stages from free-text pathology reports , 2010, J. Am. Medical Informatics Assoc..

[101]  Brian E Chapman,et al.  A semi-automated quantification of pulmonary artery dimensions in computed tomography angiography images. , 2012, AMIA ... Annual Symposium proceedings. AMIA Symposium.