Semantic Annotation of Clinical Text : The CLEF Corpus

A significant amount of important information in Electronic Health Records (EHRs) is often found only in the unstructured part of patient narratives, making it difficult to process and utilize for tasks such as evidence-based health care or clinical research. In this paper we describe the work carried out in the CLEF project for the semantic annotation of a corpus to assist in the development and evaluation of an Information Extraction (IE) system as part of a larger framework for the capture, integration and presentation of clinical information. The CLEF corpus consists of both structured records and free text documents from the Royal Marsden Hospital pertaining to deceased cancer patients. The free text documents are of three types: clinical narratives, radiology reports and histopathology reports. A subset of the corpus has been selected for semantic annotation and two annotation schemes have been created and used to annotate: (i) a set of clinical entities and the relations between them, and (ii) a set of annotations for time expressions and their temporal relations with the clinical entities in the text. The paper describes the make-up of the annotated corpus, the semantic annotation schemes used to annotate it, details of the annotation process and of inter-annotator agreement studies, and how the annotated corpus is being used for developing supervised machine learning models for IE tasks.

[1]  Nancy L. Martin,et al.  Knowledge-based systems development : a methodology for management , 1992 .

[2]  Naomi Sager,et al.  Research Paper: Natural Language Processing and the Representation of Clinical Data , 1994, J. Am. Medical Informatics Assoc..

[3]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[4]  Richard M. Schwartz,et al.  Annotating Resources for Information Extraction , 2000, LREC.

[5]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[6]  Alan L. Rector,et al.  CLEF - Joining up Healthcare with Clinical and Post-Genomic Research , 2003 .

[7]  Henrik Eriksson,et al.  The evolution of Protégé: an environment for knowledge-based systems development , 2003, Int. J. Hum. Comput. Stud..

[8]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[9]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[10]  Barbara Rosario,et al.  Multi-way Relation Classification: Application to Protein-Protein Interactions , 2005, HLT.

[11]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[12]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[13]  R. Power,et al.  Summarisation and Visualisation of e-Health Data Repositories , 2005 .

[14]  Eugene Kim,et al.  Overview of the ImageCLEFmed 2006 Medical Retrieval and Medical Annotation Tasks , 2006, CLEF.

[15]  Christopher G. Chute,et al.  Building and Evaluating Annotated Corpora for Medical NLP Systems , 2006, AMIA.

[16]  Eugene Kim,et al.  Overview of the ImageCLEFmed 2006 Medical Retrieval and Annotation Tasks , 2006, CLEF.

[17]  Patrick Ruch,et al.  Model Formulation: Advancing Biomedical Image Retrieval: Development and Analysis of a Test Collection , 2006, J. Am. Medical Informatics Assoc..

[18]  Alan L. Rector,et al.  The CLEF Chronicle: Patient Histories Derived from Electronic Health Records , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[19]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[20]  Angus Roberts,et al.  The CLEF Corpus: Semantic Annotation of Clinical Text , 2007, AMIA.

[21]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[22]  James Pustejovsky,et al.  SemEval-2007 Task 15: TempEval Temporal Relation Identification , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[23]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[24]  Angus Roberts,et al.  Combining Terminology Resources and Statistical Methods for Entity Recognition: an Evaluation , 2008, LREC.