EliIE: An open-source information extraction system for clinical trial eligibility criteria

Objective To develop an open-source information extraction system called Eligibility Criteria Information Extraction (EliIE) for parsing and formalizing free-text clinical research eligibility criteria (EC) following Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) version 5.0. Materials and Methods EliIE parses EC in 4 steps: (1) clinical entity and attribute recognition, (2) negation detection, (3) relation extraction, and (4) concept normalization and output structuring. Informaticians and domain experts were recruited to design an annotation guideline and generate a training corpus of annotated EC for 230 Alzheimer's clinical trials, which were represented as queries against the OMOP CDM and included 8008 entities, 3550 attributes, and 3529 relations. A sequence labeling-based method was developed for automatic entity and attribute recognition. Negation detection was supported by NegEx and a set of predefined rules. Relation extraction was achieved by a support vector machine classifier. We further performed terminology-based concept normalization and output structuring. Results In task-specific evaluations, the best F1 score for entity recognition was 0.79, and for relation extraction was 0.89. The accuracy of negation detection was 0.94. The overall accuracy for query formalization was 0.71 in an end-to-end evaluation. Conclusions This study presents EliIE, an OMOP CDM-based information extraction system for automatic structuring and formalization of free-text EC. According to our evaluation, machine learning-based EliIE outperforms existing systems and shows promise to improve.

[1]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[2]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[3]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[6]  P L Schuyler,et al.  The UMLS Metathesaurus: representing different views of biomedical concepts. , 1993, Bulletin of the Medical Library Association.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Hong Yu,et al.  AskHERMES: An online question answering system for complex clinical questions , 2011, J. Biomed. Informatics.

[9]  Makoto Miwa,et al.  End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures , 2016, ACL.

[10]  Chunhua Weng,et al.  Formal representation of eligibility criteria: A literature review , 2010, J. Biomed. Informatics.

[11]  Dong Wen,et al.  Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models , 2016, J. Biomed. Informatics.

[12]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[13]  Meliha Yetisgen-Yildiz,et al.  Tumor information extraction in radiology reports for hepatocellular carcinoma patients , 2016, CRI.

[14]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[15]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[16]  Chunhua Weng,et al.  Visual aggregate analysis of eligibility features of clinical trials , 2015, J. Biomed. Informatics.

[17]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[18]  Xiaodong Gu,et al.  Aspect-based Opinion Summarization with Convolutional Neural Networks , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[19]  Chunhua Weng,et al.  Initial Readability Assessment of Clinical Trial Eligibility Criteria , 2015, AMIA.

[20]  Chunhua Weng,et al.  Trend and Network Analysis of Common Eligibility Features for Cancer Trials in ClinicalTrials.gov , 2014, ICSH.

[21]  Anna Rumshisky,et al.  CliNER : A Lightweight Tool for Clinical Named Entity Recognition , 2015 .

[22]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[23]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[24]  Joel D. Martin,et al.  Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 , 2011, J. Am. Medical Informatics Assoc..

[25]  Yen S. Low,et al.  Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art , 2014, Drug Safety.

[26]  Patrick B. Ryan,et al.  Validation of a common data model for active safety surveillance research , 2012, J. Am. Medical Informatics Assoc..

[27]  Jakub Piskorski,et al.  Information Extraction: Past, Present and Future , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[28]  David Martínez,et al.  Evaluating the state of the art in disorder recognition and normalization of the clinical narrative , 2014, J. Am. Medical Informatics Assoc..

[29]  Rae Woong Park,et al.  Characterizing treatment pathways at scale using the OHDSI network , 2016, Proceedings of the National Academy of Sciences.

[30]  Xiaoying Wu,et al.  EliXR: an approach to eligibility criteria extraction and representation , 2011, J. Am. Medical Informatics Assoc..

[31]  Hong Yu,et al.  Bidirectional RNN for Medical Event Detection in Electronic Health Records , 2016, NAACL.

[32]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  Claire Snowdon,et al.  Does it matter if clinicians recruiting for a trial don't understand what the trial is really about? Qualitative study of surgeons' experiences of participation in a pragmatic multi-centre RCT , 2007, Trials.

[35]  Thierry Poibeau,et al.  Multi-source, Multilingual Information Extraction and Summarization , 2012, Theory and Applications of Natural Language Processing.

[36]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.

[37]  Ralph Grishman,et al.  Combining Neural Networks and Log-linear Models to Improve Relation Extraction , 2015, ArXiv.

[38]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[39]  W. Chapman,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[40]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[41]  I. Lombardo,et al.  The efficacy of RVT-101, a 5-ht6 receptor antagonist, as an adjunct to donepezil in adults with mild-to-moderate Alzheimer’s disease: Completer analysis of a phase 2b study , 2015, Alzheimer's & Dementia.

[42]  Larry P. Heck,et al.  Leveraging Deep Neural Networks and Knowledge Graphs for Entity Disambiguation , 2015, ArXiv.

[43]  Suresh Manandhar,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[44]  D. Hunninghake,et al.  Recruitment for controlled clinical trials: literature summary and annotated bibliography. , 1997, Controlled clinical trials.

[45]  Hong Yu,et al.  Learning for Biomedical Information Extraction: Methodological Review of Recent Advances , 2016, ArXiv.

[46]  S W Tu,et al.  The EON model of intervention protocols and guidelines. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[47]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[48]  Donald E. Brown,et al.  A practical application of simulated annealing to clustering , 1990, Pattern Recognit..

[49]  Hua Xu,et al.  Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features , 2013, BMC Medical Informatics and Decision Making.

[50]  G Hripcsak,et al.  A Distribution-based Method for Assessing The Differences between Clinical Trial Target Populations and Patient Populations in Electronic Health Records , 2014, Applied Clinical Informatics.

[51]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[52]  Shuang Wang,et al.  Assessing the Collective Population Representativeness of Related Type 2 Diabetes Trials by Combining Public Data from ClinicalTrials.gov and NHANES , 2015, MedInfo.

[53]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[54]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[55]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[56]  Russell L. Rothman,et al.  The ADAPTABLE Trial and PCORnet: Shining Light on a New Research Paradigm , 2015, Annals of Internal Medicine.

[57]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[58]  Jari Björne,et al.  Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization , 2013, PloS one.

[59]  Zhi Jin,et al.  Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths , 2015, EMNLP.

[60]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[61]  Hongfang Liu,et al.  Pacific Symposium on Biocomputing 9:238-249(2004) BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY , 2022 .

[62]  Chunhua Weng,et al.  Structuring Clinical Trial Eligibility Criteria with the Common Data Model , 2014 .

[63]  Hongfang Liu,et al.  Valx: A System for Extracting and Structuring Numeric Lab Test Comparison Statements from Text , 2016, Methods of Information in Medicine.

[64]  Jaime G. Carbonell,et al.  Exploring events and distributed representations of text in multi-document summarization , 2016, Knowl. Based Syst..

[65]  Hongfang Liu,et al.  Representing information in patient reports using natural language processing and the extensible markup language. , 1999, Journal of the American Medical Informatics Association : JAMIA.

[66]  Julia Adler-Milstein,et al.  Electronic Health Record Adoption In US Hospitals: Progress Continues, But Challenges Persist. , 2015, Health affairs.

[67]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.