论文信息 - Extended Coreferential Relations and Bridging Anaphora in the Prague Dependency Treebank

Extended Coreferential Relations and Bridging Anaphora in the Prague Dependency Treebank

The present paper outlines the coding scheme for annotating extended nominal coreference and bridging relations in the Prague Dependency Treebank. We compare our annotation scheme to the existing ones with respect to the language to which the scheme is applied. We identify the annotation principles and demonstrate their application to the largescale annotation of Czech texts. We further present our classification of coreferential relations and bridging relations types and discuss some problematic aspects in this area. An automatic preannotation and some helpful features of the annotation tool, such as maintaining coreferential chain, underlining candidates for antecedents, etc. are presented and discussed. Statistical evaluation is performed on the already annotated part of the Prague Dependency Treebank. We also present the first results of the interannotator agreement measurement and explain the most frequent cases of disagreement. 1 Introduction The Prague Dependency Treebank (henceforth PDT) is a large collection of linguistically annotated data and documentation [2]. In PDT 2.0, Czech newspaper texts are annotated using a three layer annotation scenario. The most abstract (tectogrammatical) layer includes among other markups the annotation of coreferential links. The whole corpus PDT 2.0 contains almost 50 thousand sentences. In PDT 2.0, two types of coreference are (mainly manually) annotated: grammatical and textual coreference. The grammatical coreference typically occurs within a single sentence, the antecedent being able to be derived on the basis of grammatical rules of the given language. It includes coreference of relative pronouns, arguments of verbs of control, reflexive pronouns, reciprocity and verbal complements. As for textual coreference (which is not realized by grammatical means alone, but also on the basis of the context), it has been restricted up to now to cases, in which a demonstrative this or an anaphoric pronoun of the 3 rd person, also in its zero form, are used [8].

[1] Constantin Orasan,et al. PALinkA: A highly customisable tool for discourse annotation , 2003, SIGDIAL Workshop.

[2] Philip N. Johnson-Laird,et al. Thinking; Readings in Cognitive Science , 1977 .

[3] Michael Strube,et al. Multi-Level Annotation in MMAX , 2003, SIGDIAL Workshop.

[4] Eva Hajicová,et al. From Sentence to Discourse: Building an Annotation Scheme for Discourse Based on Prague Dependency Treebank , 2008, LREC.

[5] L. Ku,et al. Coreferential Relations In The Prague Dependency Treebank , 2005 .

[6] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[7] Mariona Taulé,et al. Text as Scene: Discourse Deixis and Bridging Relations , 2007, Proces. del Leng. Natural.

[8] Ron Artstein,et al. Anaphoric Annotation in the ARRAU Corpus , 2008, LREC.

[9] Mark A. Przybocki,et al. The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[10] Rebecca J. Passonneau. Computing Reliability for Coreference Annotation , 2004, LREC.

[11] Petr Pajas,et al. Recent Advances in a Feature-Rich Framework for Treebank Annotation , 2008, COLING.