Introducing the Prague Discourse Treebank 1.0

We present the Prague Discourse Treebank 1.0, a collection of Czech texts annotated for various discourse-related phenomena "beyond the sentence boundary". The treebank contains manual annotations of (1), discourse connectives, their arguments and senses, (2), textual coreference, and (3), bridging anaphora, all carried out on 50k sentences of the treebank. Contrary to most similar projects, the annotation was performed directly on top of syntactic trees (from the previous project of the Prague Dependency Treebank 2.5), benefiting thus from the linguistic information already existing on the same data. In this article, we present our theoretical background, describe the annotations in detail, and offer evaluation numbers and corpus statistics.

[1]  Jirí Mírovský,et al.  Annotation Tool for Discourse in PDT , 2010, COLING.

[2]  Manfred Stede,et al.  The Potsdam Commentary Corpus , 2004, ACL 2004.

[3]  William C. Mann,et al.  Rhetorical Structure Theory: A Framework for the Analysis of Texts , 1987 .

[4]  Rashmi Prasad,et al.  The Hindi Discourse Relation Bank , 2009, Linguistic Annotation Workshop.

[5]  Massimo Poesio,et al.  Learning to Resolve Bridging References , 2004, ACL.

[6]  Yannick Versley,et al.  Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus , 2010, LREC.

[7]  Pavlína Jínová,et al.  Semi-Automatic Annotation of Intra-Sentential Discourse Relations in PDT , 2012 .

[8]  Yuping Zhou,et al.  PDTB-style Discourse Annotation of Chinese Text , 2012, ACL.

[9]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[10]  Katja Markert,et al.  The Leeds Arabic Discourse Treebank: Annotating Discourse Connectives for Arabic , 2010, LREC.

[11]  Simone Teufel,et al.  Resolving bridging references in unrestricted text , 1997 .

[12]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[13]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[14]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[15]  Massimo Poesio,et al.  The VENEX corpus of anaphora and deixis in spoken and written Italian , 2004 .

[16]  Livio Robaldo,et al.  Sense Annotation in the Penn Discourse Treebank , 2008, CICLing.

[17]  Hong Yu,et al.  The biomedical discourse relation bank , 2011, BMC Bioinformatics.

[18]  Anna Nedoluzhko,et al.  Annotating extended textual coreference and bridging relations in the Prague Dependency , 2012 .

[19]  Mitchell P. Marcus,et al.  OntoNotes: A Unified Relational Semantic Representation , 2007, International Conference on Semantic Computing (ICSC 2007).

[20]  L. Ku,et al.  Coreferential Relations In The Prague Dependency Treebank , 2005 .

[21]  Jirí Mírovský,et al.  Connective-Based Measuring of the Inter-Annotator Agreement in the Annotation of Discourse in PDT , 2010, COLING.

[22]  Eduard Bejcek,et al.  Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0 , 2012, COLING.

[23]  Avlína,et al.  MANUAL FOR ANNOTATION OF DISCOURSE RELATIONS IN THE PRAGUE DEPENDENCY TREEBANK , 2012 .

[24]  Edward Gibson,et al.  Representing Discourse Coherence: A Corpus-Based Study , 2005, CL.

[25]  Anna Nedoluzhko A COREFERENTIALLY ANNOTATED CORPUS AND ANAPHORA RESOLUTION FOR CZECH , 2013 .

[26]  Petr Pajas,et al.  Annotation Tool for Extended Textual Coreference and Bridging Anaphora , 2010, LREC.

[27]  L. Danlos,et al.  Vers le FDTB : French Discourse Tree Bank , 2012 .

[28]  Maria Antònia Martí,et al.  AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan , 2010, Lang. Resour. Evaluation.

[29]  John A. Bateman,et al.  Rhetorical structure theory , 2006 .

[30]  Petr Pajas,et al.  Recent Advances in a Feature-Rich Framework for Treebank Annotation , 2008, COLING.

[31]  Matthew Stone,et al.  Anaphora and Discourse Structure , 2001, CL.

[32]  Iris Hendrickx,et al.  Analysis and Reference Resolution of Bridge Anaphora across Different Text Genres , 2011, DAARC.

[33]  Deniz Zeyrek,et al.  The Annotation Scheme of the Turkish Discourse Bank and an Evaluation of Inconsistent Annotations , 2010, Linguistic Annotation Workshop.

[34]  Massimo Poesio,et al.  The MATE/GNOME Proposals for Anaphoric Annotation, Revisited , 2004, SIGDIAL Workshop.

[35]  Herbert H. Clark,et al.  Bridging , 1975, TINLAP.

[36]  Sandra Kübler,et al.  Recent Developments in Linguistic Annotations of the TüBa-D / Z Treebank , 1999 .

[37]  Ron Artstein,et al.  Anaphoric Annotation in the ARRAU Corpus , 2008, LREC.

[38]  Matthias Buch-Kromann,et al.  Anaphoric Relations in the Copenhagen Dependency Treebanks , 2011 .

[39]  H. Hedeland,et al.  Annotation of Explicit and Implicit Discourse Relations in the TüBa-D/Z Treebank , 2011 .

[40]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[41]  Rashmi Prasad,et al.  Annotation of Discourse Relations for Conversational Spoken Dialogs , 2010, LREC.

[42]  Christian Chiarcos,et al.  PoCoS - Potsdam Coreference Scheme , 2007, LAW@ACL.

[43]  Ludovic Tanguy,et al.  An empirical resource for discovering cognitive principles of discourse organisation: the ANNODIS corpus , 2012, LREC.