Building a Coreference-Annotated Corpus from the Domain of Biochemistry

One of the reasons for which the resolution of coreferences has remained a challenging information extraction task, especially in the biomedical domain, is the lack of training data in the form of annotated corpora. In order to address this issue, we developed the HANAPIN corpus. It consists of full-text articles from biochemistry literature, covering entities of several semantic types: chemical compounds, drug targets (e.g., proteins, enzymes, cell lines, pathogens), diseases, organisms and drug effects. All of the co-referring expressions pertaining to these semantic types were annotated based on the annotation scheme that we developed. We observed four general types of coreferences in the corpus: sortal, pronominal, abbreviation and numerical. Using the MASI distance metric, we obtained 84% in computing the inter-annotator agreement in terms of Krippendorff's alpha. Consisting of 20 full-text, open-access articles, the corpus will enable other researchers to use it as a resource for their own coreference resolution methodologies.

[1]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[2]  James R. Curran,et al.  Challenges for automatically extracting molecular interactions from full-text articles , 2009, BMC Bioinformatics.

[3]  Philip V. Ogren,et al.  Annotation of all coreference in biomedical text : Guideline selection and adaptation , 2010 .

[4]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[5]  César de Pablo-Sánchez,et al.  Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents , 2010, BMC Bioinformatics.

[6]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[7]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[8]  K. Bretonnel Cohen,et al.  Empirical data on corpus design and usage in biomedical natural language processing , 2005, AMIA.

[9]  Ruth L. Seal,et al.  Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme , 2007 .

[10]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[11]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[12]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[13]  Rebecca J. Passonneau Computing Reliability for Coreference Annotation , 2004, LREC.

[14]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[15]  R. Mitkov,et al.  Coreference and anaphora: developing annotating tools, annotated resources and annotation strategies , 2000 .