Tools to Address the Interdependence between Tokenisation and Standoff Annotation

In this paper we discuss technical issues arising from the interdependence between tokenisation and XML-based annotation tools, in particular those which use standoff annotation in the form of pointers to word tokens. It is common practice for an XML-based annotation tool to use word tokens as the target units for annotating such things as named entities because it provides appropriate units for stand-off annotation. Furthermore, these units can be easily selected, swept out or snapped to by the annotators and certain classes of annotation mistakes can be prevented by building a tool that does not permit selection of a substring which does not entirely span one or more XML elements. There is a downside to this method of annotation, however, in that it assumes that for any given data set, in whatever domain, the optimal tokenisation is known before any annotation is performed. If mistakes are made in the initial tokenisation and the word boundaries conflict with the annotators' desired actions, then either the annotation is inaccurate or expensive retokenisation and reannotation will be required. Here we describe the methods we have developed to address this problem. We also describe experiments which explore the effects of different granularities of tokenisation on NER tagger performance.

[1]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[2]  Beatrice Alex,et al.  Investigating the Effects of Selective Sampling on the Annotation Task , 2005 .

[3]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[4]  Marc Moens,et al.  Seventh Message Understanding Conference (MUC-7) , 1998 .

[5]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[6]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[7]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[8]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[9]  Stefan Evert,et al.  The NITE XML Toolkit: Flexible annotation for multimodal language data , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[10]  Beatrice Alex,et al.  Optimising Selective Sampling for Bootstrapping Named Entity Recognition , 2005, ICML 2005.

[11]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[12]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[13]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[14]  Jean Carletta,et al.  The NITE XML Toolkit: Data Model and Query Language , 2005, Lang. Resour. Evaluation.

[15]  Malvina Nissim,et al.  A System for Identifying Named Entities in Biomedical Text: how Results From two Evaluations Reflect on Both the System and the Evaluations , 2005, Comparative and functional genomics.

[16]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[17]  Malvina Nissim,et al.  Recognising Geographical Entities in Scottish Historical Documents , 2003 .

[18]  Claire Grover,et al.  The HOLJ Corpus. Supporting Summarisation of Legal Texts , 2004 .