From D-Coi to SoNaR: a reference corpus for Dutch

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.

[1]  Gerwert Stevens,et al.  A pilot study for automatic semantic role labeling in a Dutch corpus , 2007, CLIN 2007.

[2]  Paola Monachesi,et al.  Adding Semantic Role Annotation to a Corpus of Written Dutch , 2007, LAW@ACL.

[3]  Lou Boves,et al.  User requirements analysis for the design of a reference corpus of written Dutch , 2006, LREC.

[4]  Luis Gravano,et al.  Computing Geographical Scopes of Web Resources , 2000, VLDB.

[5]  David M. Carter,et al.  The TreeBanker: a Tool for Supervised Training of Parsed Corpora , 1997, ArXiv.

[6]  Inderjeet Mani,et al.  2003 Standard for the Annotation of Temporal Expressions , 2004 .

[7]  Martin Reynaert Corpus-Induced Corpus Clean-up , 2006, LREC.

[8]  Jochen L. Leidner Toponym Resolution : A First Large-Scale Comparative Evaluation , 2006 .

[9]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[10]  Mark-Jan Nederhof,et al.  Parsing Partially Bracketed Input , 2005, CLIN.

[11]  Antal van den Bosch,et al.  Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development , 2006, LREC.

[12]  Raphael Volz,et al.  Towards Ontology-based Disambiguation of Geographical Identifiers , 2007, I3.

[13]  Ineke Schuurman Which New York, which Monday? The role of background knowledge and intended audience in automatic disambiguation of spatiotemporal expressions , 2007, CLIN 2007.

[14]  Martin Reynaert,et al.  Non-interactive OCR Post-correction for Giga-Scale Digitization Projects , 2008, CICLing.

[15]  Ineke Schuurman,et al.  Spatiotemporal Annotation on Top of an Existing Treebank , 2007 .

[16]  Ineke Schuurman Spatiotemporal Annotation Using MiniSTEx: how to deal with Alternative, Foreign, Vague and/or Obsolete Names? , 2008, LREC.

[17]  Gertjan van Noord,et al.  At Last Parsing Is Now Operational , 2006, JEPTALNRECITAL.

[18]  Mitchell P. Marcus,et al.  Adding Semantic Annotation to the Penn TreeBank , 1998 .

[19]  Ineke Schuurman,et al.  The contours of a semantic annotation scheme for Dutch , 2006 .

[20]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[21]  Josef Ruppenhofer,et al.  FrameNet: Theory and Practice , 2003 .

[22]  Gertjan van Noord,et al.  Syntactic Annotation of Large Corpora in STEVIN , 2006, LREC.

[23]  Piet Mertens,et al.  Verbum ex machina. Actes de la 13e conférence sur le traitement automatique des langues naturelles. , 2006 .

[24]  Ineke Schuurman,et al.  CGN, an annotated corpus of spoken Dutch , 2003, LINC@EACL.

[25]  Martha Palmer,et al.  Adding predicate argument structure to the Penn TreeBank , 2002 .

[26]  João D. Pereira,et al.  Merging FrameNet and PropBank in a corpus of written Dutch , 2006 .