Named Entities in Medical Case Reports: Corpus and Experiments

We present a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library. In the case reports, we annotate cases, conditions, findings, factors and negation modifiers. Moreover, where applicable, we annotate relations between these entities. As such, this is the first corpus of this kind made available to the scientific community in English. It enables the initial investigation of automatic information extraction from case reports through tasks like Named Entity Recognition, Relation Extraction and (sentence/paragraph) relevance detection. Additionally, we present four strong baseline systems for the detection of medical entities made available through the annotated dataset.

[1]  Thomas Demeester,et al.  Adversarial training for multi-context joint entity and relation extraction , 2018, EMNLP.

[2]  Elena Beisswanger,et al.  A Proposal for a Configurable Silver Standard , 2010, Linguistic Annotation Workshop.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Juliane Fluck,et al.  Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports , 2012, J. Biomed. Informatics.

[5]  Sophia Ananiadou,et al.  Annotating and detecting phenotypic information for chronic obstructive pulmonary disease , 2019, JAMIA open.

[6]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[7]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2019 , 2018, Nucleic Acids Res..

[8]  Jurica Seva,et al.  QURATOR: Innovative Technologies for Content and Data Curation , 2020, Qurator.

[9]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Gilles Louppe,et al.  Independent consultant , 2013 .

[12]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[13]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[14]  Neil R. Smalheiser,et al.  A manual corpus of annotated main findings of clinical case reports , 2019, Database.

[15]  Chris Develder,et al.  Joint entity recognition and relation extraction as a multi-head selection problem , 2018, Expert Syst. Appl..

[16]  Iryna Gurevych,et al.  A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures , 2016, LT4DH@COLING.

[17]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Yifan Peng,et al.  BioSentVec: creating sentence embeddings for biomedical texts , 2018, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[19]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[20]  Vincent Claveau,et al.  Clinical Case Reports for NLP , 2019, BioNLP@ACL.

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  R. Rison,et al.  The CARE (CAse REport) guidelines and the standardization of case reports , 2013, Journal of Medical Case Reports.

[23]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[24]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[25]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .