The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch

The construction of a large and richly annotated corpus of written Dutch was identified as one of the priorities of the STEVIN programme. Such a corpus, sampling texts from conventional and new media, is invaluable for scientific research and application development. The present chapter describes how in two consecutive STEVIN-funded projects, viz. D-Coi and SoNaR, the Dutch reference corpus was developed. The construction of the corpus has been guided by (inter)national standards and best practices. At the same time through the achievements and the experiences gained in the D-Coi and SoNaR projects, a contribution was made to their further advancement and dissemination.

[1]  Michael Moortgat,et al.  Syntactische annotatie voor het Corpus Gesproken Nederlands (CGN) , 2002 .

[2]  Eric Sanders Collecting and Analysing Chats and Tweets in SoNaR , 2012, LREC 2012.

[3]  Ron Artstein,et al.  Anaphoric Annotation in the ARRAU Corpus , 2008, LREC.

[4]  W. Daelemans,et al.  Actieplan voor het Nederlands in de taal- en spraaktechnologie : prioriteiten voor basisvoorzieningen , 2002 .

[5]  Paola Monachesi,et al.  Adding Semantic Role Annotation to a Corpus of Written Dutch , 2007, LAW@ACL.

[6]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[7]  Ineke Schuurman,et al.  Spatiotemporal Annotation: Interaction between Standards and other Formats , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[8]  Ineke Schuurman,et al.  Cultural Aspects of Spatiotemporal Analysis in Multilingual Applications , 2010, LREC.

[9]  Ineke Schuurman Which New York, which Monday? The role of background knowledge and intended audience in automatic disambiguation of spatiotemporal expressions , 2007, CLIN 2007.

[10]  Frank Van Eynde Part of Speech Tagging en Lemmatisering , 2003 .

[11]  Martin Reynaert Corpus-Induced Corpus Clean-up , 2006, LREC.

[12]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[13]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[14]  Ineke Schuurman,et al.  Spatiotemporal Annotation on Top of an Existing Treebank , 2007 .

[15]  Martin Reynaert,et al.  Non-interactive OCR Post-correction for Giga-Scale Digitization Projects , 2008, CICLing.

[16]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[17]  Barbara Plank,et al.  Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) , 2010 .

[18]  Iris Hendrickx,et al.  Cross-Domain Dutch Coreference Resolution , 2011, RANLP.

[19]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[20]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[21]  M. van Gompel FoLiA: Format for Linguistic Annotation (Version 0.10.0 - Revision 3.3). Documentation [LST-14-01] , 2014 .

[22]  N.H.J. Oostdijk Dutch Language Corpus Initiative Pilot Corpus. Corpus description (D-Coi-06-09) (Intern rapport) , 2007 .

[23]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[24]  Gertjan van Noord,et al.  At Last Parsing Is Now Operational , 2006, JEPTALNRECITAL.

[25]  Michael Moortgat,et al.  CGN Syntactische Annotatie. Versie januari 2002 , 2002 .

[26]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[27]  Veronique Hoste,et al.  Optimization issues in machine learning of coreference resolution , 2005 .

[28]  Josef Ruppenhofer,et al.  FrameNet: Theory and Practice , 2003 .

[29]  Gertjan van Noord,et al.  Syntactic Annotation of Large Corpora in STEVIN , 2006, LREC.

[30]  Guy Aston,et al.  The BNC handbook : コーパス言語学への誘い , 2004 .

[31]  Orphée De Clercq,et al.  Collecting a corpus of Dutch SMS , 2012, LREC 2012.

[32]  Lou Boves,et al.  User requirements analysis for the design of a reference corpus of written Dutch , 2006, LREC.

[33]  N.H.J. Oostdijk A Reference Corpus of Written Dutch. Corpus design (D-Coi 06-01) , 2006 .

[34]  Walter Daelemans,et al.  A Coreference Corpus and Resolution System for Dutch , 2008, LREC.

[35]  Ineke Schuurman Spatiotemporal Annotation Using MiniSTEx: how to deal with Alternative, Foreign, Vague and/or Obsolete Names? , 2008, LREC.

[36]  L. Ku,et al.  Coreferential Relations In The Prague Dependency Treebank , 2005 .

[37]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[38]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[39]  Nelleke Oostdijk,et al.  The Spoken Dutch Corpus , 2000 .

[40]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[41]  Sven Hartrumpf,et al.  On metonymy recognition for geographic information retrieval , 2008, Int. J. Geogr. Inf. Sci..

[42]  Catherine N. Ball,et al.  A Comparative Study of PDF Generation Methods: Measuring Loss of Fidelity When Converting Arabic and Persian MS Word Files to PDF , 2011 .

[43]  Antal van den Bosch,et al.  Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development , 2006, LREC.

[44]  Ineke Schuurman,et al.  CGN, an annotated corpus of spoken Dutch , 2003, LINC@EACL.

[45]  Malvina Nissim,et al.  Towards a Corpus Annotated for Metonymies: the Case of Location Names , 2002, LREC.

[46]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[47]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[48]  Maria Antònia Martí,et al.  AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan , 2010, Lang. Resour. Evaluation.

[49]  Orphée De Clercq,et al.  SoNaR acquisition manual, version 1.0 , 2010 .

[50]  Nelleke Oostdijk,et al.  The spoken Dutch Corpus. Outline and first evaluation , 2000 .

[51]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner , 2007 .

[52]  Scott Martens,et al.  Varro: An Algorithm and Toolkit for Regular Structure Discovery in Treebanks , 2010, COLING.

[53]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.