Annotation of a Large Clinical Entity Corpus

Having an entity annotated corpus of the clinical domain is one of the basic requirements for detection of clinical entities using machine learning (ML) approaches. Past researches have shown the superiority of statistical/ML approaches over the rule based approaches. But in order to take full advantage of the ML approaches, an accurately annotated corpus becomes an essential requirement. Though there are a few annotated corpora available either on a small data set, or covering a narrower domain (like cancer patients records, lab reports), annotation of a large data set representing the entire clinical domain has not been created yet. In this paper, we have described in detail the annotation guidelines, annotation process and our approaches in creating a CER (clinical entity recognition) corpus of 5,160 clinical documents from forty different clinical specialities. The clinical entities range across various types such as diseases, procedures, medications, medical devices and so on. We have classified them into eleven categories for annotation. Our annotation also reflects the relations among the group of entities that constitute larger concepts altogether.

[1]  Amrish Patel,et al.  ezDI: A Supervised NLP System for Clinical Narrative Analysis , 2015, *SEMEVAL.

[2]  Shuying Shen,et al.  Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease , 2009, BMC Bioinformatics.

[3]  Peter J. Haug,et al.  Research Paper: Automatic Detection of Acute Bacterial Pneumonia from Chest X-ray Reports , 2000, J. Am. Medical Informatics Assoc..

[4]  Amrish Patel,et al.  ezDI: A Hybrid CRF and SVM based Model for Detecting and Encoding Disorder Mentions in Clinical Notes , 2014, *SEMEVAL.

[5]  Stefan Decker,et al.  Creating Semantic Web Contents with Protégé-2000 , 2001, IEEE Intell. Syst..

[6]  Narayan Choudhary,et al.  A Treebank for the Healthcare Domain , 2018, LAW-MWE-CxG@COLING.

[7]  Yefeng Wang,et al.  Annotating and Recognising Named Entities in Clinical Notes , 2009, ACL.

[8]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[9]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[10]  Suresh Manandhar,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[11]  Martin Wynne,et al.  Developing Linguistic Corpora: a Guide to Good Practice , 2005 .

[12]  Christopher G. Chute,et al.  Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record , 2012, J. Am. Medical Informatics Assoc..

[13]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[14]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[15]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[16]  Rosemary Tate,et al.  Annotating a corpus of clinical text records for learning to recognize symptoms automatically , 2011 .

[17]  Angus Roberts,et al.  Building a semantically annotated corpus of clinical texts , 2009, J. Biomed. Informatics.

[18]  Narayan Choudhary,et al.  Annotating a Large Representative Corpus of Clinical Notes for Parts of Speech , 2014, LAW@COLING.

[19]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[20]  W. Chapman,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.