Chia, a large annotated corpus of clinical trial eligibility criteria

We present Chia, a novel, large annotated corpus of patient eligibility criteria extracted from 1,000 interventional, Phase IV clinical trials registered in ClinicalTrials.gov. This dataset includes 12,409 annotated eligibility criteria, represented by 41,487 distinctive entities of 15 entity types and 25,017 relationships of 12 relationship types. Each criterion is represented as a directed acyclic graph, which can be easily transformed into Boolean logic to form a database query. Chia can serve as a shared benchmark to develop and test future machine learning, rule-based, or hybrid methods for information extraction from free-text clinical trial eligibility criteria. Measurement(s) Clinical Trial Eligibility Criteria • Analytical Procedure Accuracy Technology Type(s) digital curation • computational modeling technique Sample Characteristic - Organism Homo sapiens Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12765602

[1]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[2]  Yi Guo,et al.  Computable Eligibility Criteria through Ontology-driven Data Access: A Case Study of Hepatitis C Virus Trials , 2018, AMIA.

[3]  Chunhua Weng,et al.  Optimizing Clinical Research Participant Selection with Informatics. , 2015, Trends in pharmacological sciences.

[4]  Parth Pathak,et al.  Annotation of a Large Clinical Entity Corpus , 2018, EMNLP.

[5]  Chunhua Weng,et al.  EliIE: An open-source information extraction system for clinical trial eligibility criteria , 2017, J. Am. Medical Informatics Assoc..

[6]  Tony Tse,et al.  Terminated Trials in the ClinicalTrials.gov Results Database: Evaluation of Availability of Primary Outcome Data and Reasons for Termination , 2015, PloS one.

[7]  Chunhua Weng,et al.  A graph-based method for reconstructing entities from coordination ellipsis in medical text , 2020, J. Am. Medical Informatics Assoc..

[8]  Chunhua Weng,et al.  Formal representation of eligibility criteria: A literature review , 2010, J. Biomed. Informatics.

[9]  Frank van Harmelen,et al.  Enhancing reuse of structured eligibility criteria and supporting their relaxation , 2015, J. Biomed. Informatics.

[10]  Peng Jin,et al.  Criteria2Query: a natural language interface to clinical databases for cohort definition , 2019, J. Am. Medical Informatics Assoc..

[11]  Kenneth A. Loparo,et al.  Knowledge-guided Text Structuring in Clinical Trials , 2019, ICDM.

[12]  Mor Peleg,et al.  A practical method for transforming free-text eligibility criteria into computable criteria , 2011, J. Biomed. Informatics.

[13]  S. Tu,et al.  Analysis of Eligibility Criteria Complexity in Clinical Trials , 2010, Summit on translational bioinformatics.

[14]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[15]  Harlan M Krumholz,et al.  Participation in cancer clinical trials: race-, sex-, and age-based disparities. , 2004, JAMA.

[16]  Xiaoying Wu,et al.  EliXR: an approach to eligibility criteria extraction and representation , 2011, J. Am. Medical Informatics Assoc..

[17]  David W. Embley,et al.  Formulating Queries for Assessing Clinical Trial Eligibility , 2006, NLDB.

[18]  Beatrice Alex,et al.  Recognising Nested Named Entities in Biomedical Text , 2007, BioNLP@ACL.

[19]  Donghui Li,et al.  MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts , 2019, AKBC.

[20]  Nigam H. Shah,et al.  Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network , 2017, CRI.

[21]  Viraj Suvarna,et al.  Phase IV of Drug Development , 2010, Perspectives in clinical research.

[22]  Theodora A. Varvarigou,et al.  A novel semantic representation for eligibility criteria in clinical trials , 2017, J. Biomed. Informatics.

[23]  Tony Tse,et al.  10-Year Update on Study Results Submitted to ClinicalTrials.gov. , 2019, The New England journal of medicine.

[24]  Chunhua Weng,et al.  Extracting temporal constraints from clinical research eligibility criteria using conditional random fields. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[25]  Tina Hernandez-Boussard,et al.  Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models. , 2018, Annual review of biomedical data science.

[26]  Chunhua Weng,et al.  Correlating eligibility criteria generalizability and adverse events using Big Data for patients and clinical trials , 2017, Annals of the New York Academy of Sciences.

[27]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[28]  Jennifer G. Robinson,et al.  Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[29]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[30]  Ching-Hua Chuan,et al.  Classifying Eligibility Criteria in Clinical Trials Using Active Deep Learning , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[31]  Chunhua Weng,et al.  Semi-Automatically Inducing Semantic Classes of Clinical Research Eligibility Criteria Using UMLS and Hierarchical Clustering. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.