Hybrid bag of approaches to characterize selection criteria for cohort identification

OBJECTIVE The 2018 National NLP Clinical Challenge (2018 n2c2) focused on the task of cohort selection for clinical trials, where participating systems were tasked with analyzing longitudinal patient records to determine if the patients met or did not meet any of the 13 selection criteria. This article describes our participation in this shared task. MATERIALS AND METHODS We followed a hybrid approach combining pattern-based, knowledge-intensive, and feature weighting techniques. After preprocessing the notes using publicly available natural language processing tools, we developed individual criterion-specific components that relied on collecting knowledge resources relevant for these criteria and pattern-based and weighting approaches to identify "met" and "not met" cases. RESULTS As part of the 2018 n2c2 challenge, 3 runs were submitted. The overall micro-averaged F1 on the training set was 0.9444. On the test set, the micro-averaged F1 for the 3 submitted runs were 0.9075, 0.9065, and 0.9056. The best run was placed second in the overall challenge and all 3 runs were statistically similar to the top-ranked system. A reimplemented system achieved the best overall F1 of 0.9111 on the test set. DISCUSSION We highlight the need for a focused resource-intensive effort to address the class imbalance in the cohort selection identification task. CONCLUSION Our hybrid approach was able to identify all selection criteria with high F1 performance on both training and test sets. Based on our participation in the 2018 n2c2 task, we conclude that there is merit in continuing a focused criterion-specific analysis and developing appropriate knowledge resources to build a quality cohort selection system.

[1]  Di Zhao,et al.  Combining PubMed knowledge and EHR data to develop a weighted bayesian network for pancreatic cancer prediction , 2011, J. Biomed. Informatics.

[2]  V. G. Vinod Vydiswaran,et al.  HyDeXT: A Hybrid De-identification and Extraction Tool for Health Text , 2017, AMIA.

[3]  L. Penberthy,et al.  Automated matching software for clinical trials eligibility: measuring efficiency and flexibility. , 2010, Contemporary clinical trials.

[4]  Christopher G Chute,et al.  Discovering peripheral arterial disease cases from radiology notes using natural language processing. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[5]  Mary F. Wisniewski,et al.  Computer Algorithms To Detect Bloodstream Infections , 2004, Emerging infectious diseases.

[6]  Christopher G Chute,et al.  An Information Extraction Framework for Cohort Identification Using Electronic Health Records , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[7]  Wendy W. Chapman,et al.  ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports , 2009, J. Biomed. Informatics.

[8]  Spencer E. Harpe,et al.  Use of International Classification of Diseases, Ninth Revision Clinical Modification Codes and Medication Use Data to Identify Nosocomial Clostridium difficile Infection , 2009, Infection Control & Hospital Epidemiology.

[9]  David W. Bates,et al.  A method and knowledge base for automated inference of patient problems from structured data in an electronic medical record , 2011, J. Am. Medical Informatics Assoc..

[10]  Małgorzata Marciniak,et al.  Rule-based information extraction from patients' clinical data , 2009, J. Biomed. Informatics.

[11]  Theodoros N. Arvanitis,et al.  Cohort Identification for Clinical Research: Querying Federated Electronic Healthcare Records Using Controlled Vocabularies and Semantic Types , 2012, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[12]  Sunghwan Sohn,et al.  Mayo Clinic Smoking Status Classification System: Extensions and Improvements , 2009, AMIA.

[13]  R. Platt,et al.  Automated Identification of Acute Hepatitis B Using Electronic Medical Record Data to Facilitate Public Health Surveillance , 2008, PloS one.

[14]  Michael Brady,et al.  Survival Prediction and Treatment Recommendation with Bayesian Techniques in Lung Cancer , 2012, AMIA.

[15]  Franck Dernoncourt,et al.  Improving Patient Cohort Identification Using Natural Language Processing , 2016 .

[16]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[17]  Finale Doshi-Velez,et al.  Electronic Health Record Based Algorithm to Identify Patients with Autism Spectrum Disorder , 2016, PloS one.

[18]  B. Mwangi,et al.  Identifying a clinical signature of suicidality among patients with mood disorders: A pilot study using a machine learning approach. , 2016, Journal of affective disorders.

[19]  Sunghwan Sohn,et al.  Drug side effect extraction from clinical narratives of psychiatry and psychology patients , 2011, J. Am. Medical Informatics Assoc..

[20]  Bruce E. Bray,et al.  A bootstrapping algorithm to improve cohort identification using structured data , 2011, J. Biomed. Informatics.

[21]  Ergin Soysal,et al.  Cohort selection for clinical trials: n2c2 2018 shared task track 1 , 2019, J. Am. Medical Informatics Assoc..

[22]  B. Skipper,et al.  Relationship between glycemic control, ethnicity and socioeconomic status in Hispanic and white non‐Hispanic youths with type 1 diabetes mellitus , 2003, Pediatric diabetes.

[23]  Peter Szolovits,et al.  NATURAL LANGUAGE PROCESSING IMPROVES PHENOTYPIC ACCURACY IN AN ELECTRONIC MEDICAL RECORD COHORT OF TYPE 2 DIABETES AND CARDIOVASCULAR DISEASE , 2014 .

[24]  James J. Cimino,et al.  Classifying Clinical Trial Eligibility Criteria to Facilitate Phased Cohort Identification Using Clinical Data Repositories , 2017, AMIA.

[25]  Mit Critical Data Secondary Analysis of Electronic Health Records , 2016 .

[26]  Guo-Qiang Zhang,et al.  EpiDEA: Extracting Structured Epilepsy and Seizure Information from Patient Discharge Summaries for Cohort Identification , 2012, AMIA.

[27]  Anthony N. Nguyen,et al.  Symbolic rule-based classification of lung cancer stages from free-text pathology reports , 2010, J. Am. Medical Informatics Assoc..

[28]  Viola Vaccarino,et al.  Glucose-Independent, Black–White Differences in Hemoglobin A1c Levels , 2010, Annals of Internal Medicine.

[29]  Riccardo Miotto,et al.  Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials , 2015, J. Am. Medical Informatics Assoc..

[30]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[31]  Guangxin Xiang,et al.  Association between glycated hemoglobin A1c levels with age and gender in Chinese adults with no prior diagnosis of diabetes mellitus. , 2016, Biomedical reports.

[32]  S. Mani,et al.  Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[33]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[34]  B. Yawn,et al.  Identifying Persons with Diabetes Using Medicare Claims Data , 1999, American journal of medical quality : the official journal of the American College of Medical Quality.

[35]  Peggy L. Peissig,et al.  Learning to Predict Post-Hospitalization VTE Risk from EHR Data , 2012, AMIA.

[36]  Chen Lin,et al.  Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record , 2015, J. Am. Medical Informatics Assoc..

[37]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[38]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[39]  Christopher G Chute,et al.  A high throughput semantic concept frequency based approach for patient identification: a case study using type 2 diabetes mellitus clinical notes. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[40]  Scott R. Halgrim,et al.  Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence. , 2014, American journal of epidemiology.

[41]  Kai Zheng,et al.  Mining Consumer Health Vocabulary from Community-Generated Text , 2014, AMIA.