Interacting Semantic Layers of Annotation in SoNaR, a Reference Corpus of Contemporary Written Dutch

This paper reports on the annotation of a corpus of 1 million words with four semantic annotation layers, including named entities, co- reference relations, semantic roles and spatial and temporal expressions. These semantic annotation layers can benefit from the manually verified part of speech tagging, lemmatization and syntactic analysis (dependency tree) information layers which resulted from an earlier project (Van Noord et al., 2006) and will thus result in a deeply syntactically and semantically annotated corpus. This annotation effort is carried out in the framework of a larger project which aims at the collection of a 500-million word corpus of contemporary Dutch, covering the variants used in the Netherlands and Flanders, the Dutch speaking part of Belgium. All the annotation schemes used were (co-)developed by the authors within the Flemish-Dutch STEVIN-programme as no previous schemes for Dutch were available. They were created taking into account standards (either de facto or official (like ISO)) used elsewhere.

[1]  Isabelle Delaere,et al.  Cultivating trees: adding several semantic layers to the Lassy treebank in SoNaR , 2008 .

[2]  Eva Hajicová,et al.  From Sentence to Discourse: Building an Annotation Scheme for Discourse Based on Prague Dependency Treebank , 2008, LREC.

[3]  Malvina Nissim,et al.  Towards a Corpus Annotated for Metonymies: the Case of Location Names , 2002, LREC.

[4]  Walter Daelemans,et al.  Learning Dutch Coreference Resolution , 2005, CLIN.

[5]  Franciska de Jong,et al.  Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus , 2010, LREC.

[6]  Nancy Chinchor,et al.  Appendix E: MUC-7 Named Entity Task Definition (version 3.5) , 1998, MUC.

[7]  Véronique Hoste,et al.  Towards a Balanced Named Entity Corpus for Dutch , 2010, LREC.

[8]  Ineke Schuurman,et al.  Spatiotemporal Annotation on Top of an Existing Treebank , 2007 .

[9]  Estela Saquete Boró,et al.  Using Semantic Networks to Identify Temporal Expressions from Semantic Roles , 2009, RANLP.

[10]  Ineke Schuurman,et al.  Cultural Aspects of Spatiotemporal Analysis in Multilingual Applications , 2010, LREC.

[11]  Ineke Schuurman Which New York, which Monday? The role of background knowledge and intended audience in automatic disambiguation of spatiotemporal expressions , 2007, CLIN 2007.

[12]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[13]  Paola Monachesi,et al.  Adding Semantic Role Annotation to a Corpus of Written Dutch , 2007, LAW@ACL.

[14]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[15]  G. Lakoff,et al.  Metaphors We Live by , 1982 .

[16]  Erik F. Tjong Kim Sang,et al.  Memory-Based Named Entity Recognition , 2002, CoNLL.

[17]  Ineke Schuurman Spatiotemporal Annotation Using MiniSTEx: how to deal with Alternative, Foreign, Vague and/or Obsolete Names? , 2008, LREC.

[18]  Sven Hartrumpf,et al.  On metonymy recognition for geographic information retrieval , 2008, Int. J. Geogr. Inf. Sci..

[19]  Veronique Hoste,et al.  Optimization issues in machine learning of coreference resolution , 2005 .

[20]  Gertjan van Noord,et al.  Syntactic Annotation of Large Corpora in STEVIN , 2006, LREC.

[21]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.