Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach

Background Clinical trials are an important step in introducing new interventions into clinical practice by generating data on their safety and efficacy. Clinical trials need to ensure that participants are similar so that the findings can be attributed to the interventions studied and not to some other factors. Therefore, each clinical trial defines eligibility criteria, which describe characteristics that must be shared by the participants. Unfortunately, the complexities of eligibility criteria may not allow them to be translated directly into readily executable database queries. Instead, they may require careful analysis of the narrative sections of medical records. Manual screening of medical records is time consuming, thus negatively affecting the timeliness of the recruitment process. Objective Track 1 of the 2018 National Natural Language Processing Clinical Challenge focused on the task of cohort selection for clinical trials, aiming to answer the following question: Can natural language processing be applied to narrative medical records to identify patients who meet eligibility criteria for clinical trials? The task required the participating systems to analyze longitudinal patient records to determine if the corresponding patients met the given eligibility criteria. We aimed to describe a system developed to address this task. Methods Our system consisted of 13 classifiers, one for each eligibility criterion. All classifiers used a bag-of-words document representation model. To prevent the loss of relevant contextual information associated with such representation, a pattern-matching approach was used to extract context-sensitive features. They were embedded back into the text as lexically distinguishable tokens, which were consequently featured in the bag-of-words representation. Supervised machine learning was chosen wherever a sufficient number of both positive and negative instances was available to learn from. A rule-based approach focusing on a small set of relevant features was chosen for the remaining criteria. Results The system was evaluated using microaveraged F measure. Overall, 4 machine algorithms, including support vector machine, logistic regression, naïve Bayesian classifier, and gradient tree boosting (GTB), were evaluated on the training data using 10–fold cross-validation. Overall, GTB demonstrated the most consistent performance. Its performance peaked when oversampling was used to balance the training data. The final evaluation was performed on previously unseen test data. On average, the F measure of 89.04% was comparable to 3 of the top ranked performances in the shared task (91.11%, 90.28%, and 90.21%). With an F measure of 88.14%, we significantly outperformed these systems (81.03%, 78.50%, and 70.81%) in identifying patients with advanced coronary artery disease. Conclusions The holdout evaluation provides evidence that our system was able to identify eligible patients for the given clinical trial with high accuracy. Our approach demonstrates how rule-based knowledge infusion can improve the performance of machine learning algorithms even when trained on a relatively small dataset.

[1]  Tim Ramsay,et al.  Unsuccessful trial accrual and human subjects protections: An empirical analysis of recently closed trials , 2015, Clinical trials.

[2]  Uri Kartoun,et al.  Development and Validation of an Algorithm to Identify Nonalcoholic Fatty Liver Disease in the Electronic Medical Record , 2016, Digestive Diseases and Sciences.

[3]  K. E. Ravikumar,et al.  Automated chart review for asthma cohort identification using natural language processing: an exploratory study. , 2013, Annals of allergy, asthma & immunology : official publication of the American College of Allergy, Asthma, & Immunology.

[4]  P. Burnap,et al.  A Naïve Bayes Approach to Classifying Topics in Suicide Notes , 2012, Biomedical informatics insights.

[5]  William R. Hersh,et al.  Test collections for electronic health record-based clinical information retrieval , 2019, JAMIA open.

[6]  J. Olson,et al.  A Robust e-Epidemiology Tool in Phenotyping Heart Failure with Differentiation for Preserved and Reduced Ejection Fraction: the Electronic Medical Records and Genomics (eMERGE) Network , 2015, Journal of Cardiovascular Translational Research.

[7]  Madia Essiet,et al.  Hybrid bag of approaches to characterize selection criteria for cohort identification , 2019, J. Am. Medical Informatics Assoc..

[8]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[9]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[10]  W Chen,et al.  Interactive Cohort Identification of Sleep Disorder Patients Using Natural Language Processing and i2b2 , 2015, Applied Clinical Informatics.

[11]  Özlem Uzuner,et al.  A systematic comparison of feature space effects on disease classifier performance for phenotype identification of five diseases , 2015, J. Biomed. Informatics.

[12]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[13]  Christopher D. Wickens,et al.  A model for types and levels of human interaction with automation , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[14]  Irena Spasic,et al.  A Deep Learning Approach to Self-expansion of Abbreviations Based on Morphology and Context Distance , 2019, SLSP.

[15]  Cui Tao,et al.  Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[16]  Chengyi Zheng,et al.  Automated Identification of Patients With Pulmonary Nodules in an Integrated Health System Using Administrative Health Plan Data, Radiology Reports, and Natural Language Processing , 2012, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[17]  Olga V. Patterson,et al.  Measuring Use of Evidence Based Psychotherapy for Posttraumatic Stress Disorder in a Large National Healthcare System , 2018, Administration and Policy in Mental Health and Mental Health Services Research.

[18]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[19]  Tianxi Cai,et al.  Large-scale identification of patients with cerebral aneurysms using natural language processing , 2016, Neurology.

[20]  Peter J. Richardson,et al.  Validation of Case Finding Algorithms for Hepatocellular Cancer From Administrative Data and Electronic Health Records Using Natural Language Processing , 2016, Medical care.

[21]  B. Gage,et al.  Accuracy of ICD-9-CM Codes for Identifying Cardiovascular and Stroke Risk Factors , 2005, Medical care.

[22]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[23]  Hua Xu,et al.  Identifying risk factors for heart disease over time: Overview of 2014 i2b2/UTHealth shared task Track 2 , 2015, J. Biomed. Informatics.

[24]  Shaun Treweek,et al.  Methods to improve recruitment to randomised controlled trials: Cochrane systematic review and meta-analysis , 2013, BMJ Open.

[25]  Goran Nenadic,et al.  Medication information extraction with linguistic pattern matching and semantic rules , 2010, J. Am. Medical Informatics Assoc..

[26]  C. Zheng,et al.  Using Natural Language Processing and Machine Learning to Identify Gout Flares From Electronic Clinical Notes , 2014, Arthritis care & research.

[27]  Grant D. Huang,et al.  Clinical trials recruitment planning: A proposed framework from the Clinical Trials Transformation Initiative. , 2018, Contemporary clinical trials.

[28]  Özlem Uzuner,et al.  Creation of a new longitudinal corpus of clinical narratives , 2015, J. Biomed. Informatics.

[29]  Hongfang Liu,et al.  Aligned-Layer Text Search in Clinical Notes , 2020, MedInfo.

[30]  Stefan Schulz,et al.  Secondary use of electronic health records for building cohort studies through top-down information extraction , 2015, J. Biomed. Informatics.

[31]  Judith W. Dexheimer,et al.  A Real-Time Automated Patient Screening System for Clinical Trials Eligibility in an Emergency Department: Design and Evaluation , 2019, JMIR medical informatics.

[32]  Christopher G. Chute,et al.  Prospective recruitment of patients with congestive heart failure using an ad-hoc binary classifier , 2005, J. Biomed. Informatics.

[33]  Louise Deléger,et al.  Increasing the efficiency of trial-patient matching: automated clinical trial eligibility Pre-screening for pediatric oncology patients , 2015, BMC Medical Informatics and Decision Making.

[34]  Ying Xiong,et al.  Cohort selection for clinical trials using hierarchical neural network , 2019, J. Am. Medical Informatics Assoc..

[35]  Dawn L Hershman,et al.  Systematic Review and Meta-Analysis of the Magnitude of Structural, Clinical, and Physician and Patient Barriers to Cancer Clinical Trial Participation , 2019, Journal of the National Cancer Institute.

[36]  Eric Fosler-Lussier,et al.  Textual inference for eligibility criteria resolution in clinical trials , 2015, J. Biomed. Informatics.

[37]  Long Chen,et al.  Clinical trial cohort selection based on multi-level rule-based natural language processing system , 2019, J. Am. Medical Informatics Assoc..

[38]  Judith W. Dexheimer,et al.  Automated clinical trial eligibility prescreening: increasing the efficiency of patient identification for clinical trials in the emergency department , 2014, J. Am. Medical Informatics Assoc..

[39]  Siddhartha R. Jonnalagadda,et al.  Text Mining of the Electronic Health Record: An Information Extraction Approach for Automated Identification and Subphenotyping of HFpEF Patients for Clinical Trials , 2017, Journal of Cardiovascular Translational Research.