It is well known that standardising the annotation of language resources significantly raises their potential, as it enables re-use and spurs the development of common technologies. Despite the fact that increasingly complex linguistic information is being added to biomedical texts, no standard solutions have so far been proposed for their encoding. This paper describes a standardised XML tagset (DTD) for annotated biomedical corpora and other resources, which is based on the Text Encoding Initiative Guidelines P4, a general and parameterisable standard for encoding language resources. We ground the discussion in the encoding of the GENIA corpus, which currently contains 2,000 abstracts taken from the MEDLINE database, and has almost 100,000 hand-annotated terms marked for semantic class from the accompanying ontology. The paper introduces GENIA and TEI and implements a TEI parametrisation and conversion for the GENIA corpus. A number of aspects of biomedical language are discussed, such as complex tokenisation, prevalence of contractions and complex terms, and the linkage and encoding of ontologies.
[1]
C. M. Sperberg-McQueen,et al.
Guidelines for electronic text encoding and interchange
,
1994
.
[2]
Mirella Lapata,et al.
XML-based NLP Tools for Analysing and Annotating Medical Language
,
2002,
NLPXML@COLING.
[3]
Chris Brew,et al.
Requirements, Tools, and Architectures for Annotated Corpora
,
2000
.
[4]
Jun'ichi Tsujii,et al.
Stretching TEI: Converting the Genia Corpus
,
2003,
LINC@EACL.
[5]
Jin-Dong Kim,et al.
The GENIA corpus: an annotated research abstract corpus in molecular biology domain
,
2002
.