GOLD and Discourse: Domain- and Community-Specific Extensions

Corpora annotated for discourse–related phenomena have become an important source for the empirical study of whole texts and a source of training data for the automated parsing of whole texts. By discourse–related phenomena, we refer to various textual relations (e.g., anaphoric and rhetorical relations) that hold among textual units (e.g., whole texts, adjacency pairs, and discourse segments). One of the main problems associated with corpus linguistics—and for the markup of linguistic data in general—has been the lack of interoperability between disparately marked up resources. While it is theoretically possible to standardize the content and structure of markup, the richness of the data itself suggests that no one standard scheme would suffice for all data, all of the time. In fact one of the major contributions of the E–MELD project1 has been to show that annotation elements need not be standardized as such. Rather, by following certain parameters of “best–practice” (Bird & Simons 2003) the stage can be set for wide–spread interoperability among disparate corpora. For example, one of the key strategies of the E–MELD project has been to suggest that all markup elements used in annotating linguistic data (including discourse–related corpora) should be mapped to a semantic resource that defines the meaning of each element. Such a semantic resource for descriptive discourse categories does not exist at present. To rectify this situation, we propose a discourse–specific extension to the General Ontology for Linguistic Description (GOLD), as introduced by Farrar & Langendoen (2003) and explicated in Farrar (forthcoming).

[1]  Graeme Hirst,et al.  Anaphora in Natural Language Understanding: A Survey , 1981, Lecture Notes in Computer Science.

[2]  Michael Strube,et al.  Dialogue Acts, Synchronizing Units, and Anaphora Resolution , 2000, J. Semant..

[3]  Martin van den Berg,et al.  A Rule Based Approach to Discourse Parsing , 2004, SIGDIAL Workshop.

[4]  Maki Watanabe,et al.  Discourse Tagging Reference Manual , 2001 .

[5]  Gary Simons,et al.  Seven Dimensions of Portability for Language Documentation and Description , 2002, ArXiv.

[6]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[7]  Edith Bolling Anaphora Resolution , 2006 .

[8]  Lauri Karttunen,et al.  Discourse Referents , 1969, COLING.

[9]  Scott Farrar Using ‘ Ontolinguistics ’ for language description , 2006 .

[10]  Siegfried Handschuh,et al.  Ontology-based Linguistic Annotation , 2003, ACL.

[11]  Nicholas Asher,et al.  Reference to abstract objects in discourse , 1993, Studies in linguistics and philosophy.

[12]  Gunter Senft Classificatory particles in Kilivila , 1995 .

[13]  Laura Alonso Alemany,et al.  A Framework for Feature based Description of Low level Discourse , 2004, ACL 2004.

[14]  Richard Kittredge,et al.  Towards Stratification of RST , 1993 .

[15]  Andrea C. Schalley,et al.  Ontolinguistics: How Ontological Status Shapes the Linguistics Coding of Concepts , 2007 .

[16]  Eduard Hovy,et al.  Parsimonious or Profligate: How Many and Which Discourse Structure Relations? , 1992 .

[17]  Herbert H. Clark,et al.  Bridging , 1975, TINLAP.

[18]  Nicola Guarino,et al.  The WonderWeb Library of Foundational Ontologies Preliminary Report , 2002 .

[19]  Bonnie L. Webber,et al.  Discourse Deixis: Reference to Discourse Segments , 1988, ACL.

[20]  Matthew Stone,et al.  Anaphora and Discourse Structure , 2001, CL.

[21]  Geert-Jan M. Kruijff,et al.  Discourse-level Annotation for Investigating Information Structure , 2004, ACL 2004.

[22]  Andreas Witt,et al.  Co-reference annotation and resources: A multilingual corpus of typologically diverse languages , 2002, LREC.

[23]  Johanna D. Moore,et al.  A Problem for RST: The Need for Multi-Level Discourse Analysis , 1992, CL.

[24]  Gunter Senft,et al.  Kilivila : The Language of the Trobriand Islanders , 1986 .