XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotation

This paper deals with the representation of multi-level linguistic annotations. It proposes an XML-based, generic stand-off architecture and presents an example instantiation. Application scenarios that profit from this architecture are sketched out. In recent years, corpus linguistics has become more and more important to a broad community, including people working in theoretical, applied and computational linguistics. To many of them, speech and text corpora represent a rich source of data and phenomena, forming the basis of their research. Benefit from such data is even more important if the data is annotated by suitable information, allowing for fast and effective retrieval of relevant data. Whereas corpora of the first generation featured part-of-speech and syntactic annotations (e.g. PennTreebank [MSM93], TIGER corpus [BDE04]), the focus has now switched to properties beyond the (morpho-)syntactic level. Recent corpora are annotated by semantic information (PropBank [KP02], FrameNet [JPB03], SALSA [EKPP03]), pragmatic information (Penn Discourse TreeBank [MPJW04], RST Discourse Treebank [CMO03], Potsdam Commentary Corpus [Ste04]), and dialogue structure (Switchboard SWBD-DAMSL [JSB97]). Annotations often have to be carried out manually — reliable (semi-)automatic tools exist only for the annotation of part of speech and syntax, and are restricted to well-researched languages like English or German. Moreover, hand-annotated training material is a prerequisite for the development of automatic tools. As a consequence, corpora and annotations ought to be reusable so that a large community can profit from the data. To this end, various standardization efforts have been launched. Standardization of linguistic data concerns (see, e.g., [Sch05]): (i) The physical data structure: here, XML has become the widely-recognized standard format. (ii) The logical data structure: i.e., the data models that are used to model the phenomena and their properties (e.g. hierarchical structures like trees or graphs for syntax annotations 1The research reported in this paper was jointly financed by the German Research Foundation (DFG, SFB632) and the Federal Ministry of Education and Research (BMBF grant no. 03WKH22). Many thanks go to my colleagues, especially Michael Götze, for helpful discussions of the topics addressed in this paper. vs. time-aligned tiers for speech and dialogue annotations). Examples of data models are annotation graphs [BL01] and the NITE Object Model [CKO03b]. (iii) Content: in several initiatives, XML applications for specific linguistic annotations have been developed. For instance, TEI2 (“Text Encoding Initiative”, [SB94]) defines highly-detailed DTDs for encoding all kinds of bibliographic and other information; XCES3 (“XML-based Corpus Encoding Standard”) provides DTDs for the annotation of chunks, alignment, etc. More recently, however, it has been recognized that these standardized DTDs often do not meet application-specific needs. Hence, abstract, generic XML formats have been proposed that allow for the formal integration of application-specific annotations [IR01]. For the conceptual integration of specific annotations, so-called data category repositories as well as linguistic ontologies have been developed. They define reference categories, with precise semantics and examples, that specific annotation tags ought to be mapped to (see, e.g., DOLCE4, “Descriptive Ontology for Linguistic and Cognitive Engineering”). This papers deals with the formal integration of specific annotations. It first addresses the subject of stand-off architecture (sec. 1). We then propose an XML-based representation of linguistic annotation and present an example application (instantiation) in some detail (sec. 2). We also sketch out some application scenarios that profit from such a flexible architecture (sec. 3) and address related approaches (sec. 4). 1 Stand-off Architecture As early as in the mid-nineties, the topic of “stand-off annotation” has been discussed (see, e.g., [TM97]). This term describes the situation where primary data (e.g., the source text) and annotations of this data are stored in separate files. Stand-off annotation might seem problematic, because there is no immediate connection between the text and its annotation; hence, whenever the source text is modified, extra care has to be taken to synchronize its annotation. Similarly, human inspection of the data becomes cumbersome. On the other hand, however, stand-off annotation has the great advantage of leaving the source text untouched. It thus allows for annotating text that cannot be modified for whatever reasons, e.g., because it is a text available on the Internet. Moreover, whereas XML as such does not easily account for overlapping segments and conflicting hierarchies,5 they can be marked in a natural way in stand-off annotation: by distributing annotations over different files. That is, not only is the source text separated from its annotations, but individual annotations are separated from each other as well. This way, annotations at different levels can be created and modified independently of each other. Finally, competing, alternative annotations can even be represented, e.g. variants of part-of-speech annotations that are output of different tools. 2http://www.tei-c.org/ 3http://www.cs.vassar.edu/XCES/ 4http://www.loa-cnr.it/DOLCE.html 5Different methods have been proposed to accommodate conflicting markup into XML. We will come back to them below. One of the first proposals for stand-off annotation of linguistic corpora is [DBD98]. An ISO working group is currently developing the stand-off based LAF6 (“Linguistic Annotation Framework” [IRdlC03]). Some recent corpora like the ANC (“American National Corpus” [RI04]) are encoded in stand-off architecture. In our approach presented in this paper, we also subscribe to the principles of stand-off annotation. 2 A Generic XML Format Our format defines generic XML elements like <mark> (markable), <feat> (feature), and <struct> (structure), which indicate which data type the annotation conforms to. We assume that primary data is stored in a file that optionally specifies a header, followed by a tag <body>, which contains the source text. Annotations are stored in separate files; they may refer to the source text or to other annotations. These relations are encoded by means of XLinks and XPointers. We distinguish three different types of annotations: markables, structures, and features. (i) Markables: <mark> tags specify text positions or spans of text (or spans of other markables) that can be annotated by linguistic information. For instance, <mark> tags might indicate tokens by specifying ranges of the source text, cf. fig. 1. (ii) Structures: <struct> tags are special types of markables. Similar to <mark> tags, they specify objects that then can serve as anchors for annotations. Whereas <mark> tags define simple types of anchors (flat spans of text or markables), a <struct> tag represents a complex anchor involving relations between arbitrarily many markables (including <struct> elements). Relations (<rel>) can be further specified by an attribute type, e.g. as undirected or directed (= pointers). Put differently, a <structList> specifies a complete tree or graph, which consists of single tree fragments specified by the <struct> tags, cf. fig. 1. (iii) Features: <feat> tags specify information annotated to markables or structures, which are referred to by xlink attributes. The type of information (e.g., “part of speech”) is encoded by an attribute type, cf. fig. 2. For instance, the information encoded by the first <feat> in fig. 2 can be paraphrased as follows: Take the token that is defined by the tag <mark> with the ID attribute id="tok 1" and assign the part of speech “ART” (article) to that token. We intend to adopt the idea of [CKO03a] by assuming that admissible feature values (such as “NN”, normal/common noun, or “NE”, named entity) may be complex types and are organized in a type hierarchy. For instance, “NN” and “NE” might be subtypes of the more general type “N”, noun. <feat> tags then point to some type in the hierarchy (which is stored separately), thus specifying the value of the annotated property, cf. fig. 3.7 6ISO Technical TC37/SC4, http://www.tc37sc4.org 7Type hierarchies have to be defined by the user or they may be derived from annotation schemes that incorporate hierarchies, cf. the schemes used by the annotation tool MMAX. In case no hierarchy is defined, the features will be organized in a flat list. The stand-off architecture allows the user to experiment with different hierarchies. Further examples of annotations are sketched out below. They illustrate that annotations may stem from different sources (see the attribute source) and encode various types of information. Categorial annotation (anchored to constituents) <header sfb id="rabin1.const cat" type="categories" source="TIGERcorpus"/> <featList xml:base="rabin1.const.xml"> <feat xlink:href="#syn 1" value="PN"/> <!--proper noun--> <feat xlink:href="#syn 2" value="PP"/> <!--prepos. phrase--> ... Coreference annotation, marking coreferential expressions such as pronouns (referred to xlink:href attributes) and their antecedents (identified by target attributes) <header sfb id="rabin1.coref" type="coreference" source="MMAXcoref"/> <featList> <feat xlink:href="rabin1.tok.xml#tok 19" (sein) target="rabin1.const.xml#syn 9" (Der Rabin-Attentäter Jigal Amir) value="identity"/> ... Document structure: headers, paragraphs, lists, etc. (anchored to markables that refer to tokens) <header sfb id="rabin1.div" type="divisions"/> <markList xmlns:xlink="http://www.w3.org/1999/xlink" xml:base="rabin1.tok.xml"> <mark id="div 1" xlink:href="#xpointer(id(’tok 1’)/range-to(id(’tok 390’)"/> <mark id="div 2" xlink:href="#xpointer(id(’tok 1’)/range-to(id(’tok 89’)"/> ... <header sfb id="rabin1.div docstr" type="documentStructure"/> <featL

[1]  Manfred Pinkal,et al.  Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation , 2003, ACL.

[2]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[3]  Rashmi Prasad,et al.  The Penn Discourse Treebank , 2004, LREC.

[4]  Andreas Witt,et al.  Unification of XML Documents with Concurrent Markup , 2005, Lit. Linguistic Comput..

[5]  C. M. Sperberg-McQueen,et al.  Hierarchical encoding of text: Technical problems and SGML solutions , 1995, Comput. Humanit..

[6]  Jonathan G. Fiscus,et al.  A Pratical Introduction to ATLAS , 2002, LREC.

[7]  Katrin Erk,et al.  A Powerful and Versatile XML Format for Representing Role-semantic Annotation , 2004, LREC.

[8]  Caren Brinckmann,et al.  Multi-dimensional annotation of linguistic corpora for investigating information structure , 2004, FCP@NAACL-HLT.

[9]  Jean Véronis,et al.  Text Encoding Initiative: Background and Contexts , 1995 .

[10]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[11]  Manfred Stede,et al.  ANNIS: A Linguistic Database for Exploring Information Structure , 2004 .

[12]  Josef Ruppenhofer,et al.  FrameNet: Theory and Practice , 2003 .

[13]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[14]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange : TEI P4 , 2002 .

[15]  Jonathan G. Fiscus,et al.  A Practical Introduction to ATLAS , 2002 .

[16]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[17]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[18]  Manfred Stede,et al.  The Potsdam Commentary Corpus , 2004, ACL 2004.

[19]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[20]  C. M. Sperberg-McQueen,et al.  Hierarchical Encoding of Text: Technical Problems and SGML Solutions , 1995 .

[21]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[22]  Nancy Ide,et al.  Overall Goals and the First Release , 2004 .

[23]  David McKelvie,et al.  Hyperlink semantics for standoff markup of read-only documents , 1997 .

[24]  Niels Ole Bernsen,et al.  The MATE Markup Framework , 2000, SIGDIAL Workshop.

[25]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[26]  Nancy Ide,et al.  Standards for Language Resources , 2002, LREC.

[27]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.