论文信息 - Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision taken on the level of text acquisition has ramifications for the levelof processing and the general usability of the corpus. As far as thetraditional text types are concerned, each text brings its own processingrequirements and issues. For new media texts - SMS, chat - the problem is evenmore complex, issues such as anonimity, recognizability and citation right, allpresent problems that have to be tackled. The solutions actually lead to thecreation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes,and the smaller - of commissioned size - more privacy compliant SoNaR,IPR-cleared for commercial purposes as well.

Franciska de Jong | Orphée De Clercq | Nelleke Oostdijk | Henk van den Heuvel | Martin Reynaert

[1] Adam Kilgarriff,et al. Large Linguistically-Processed Web Corpora for Multiple Languages , 2006, EACL.

[2] Stefan Evert. A Lightweight and Efficient Tool for Cleaning Web Pages , 2008, LREC.

[3] Orphée De Clercq,et al. Dutch Parallel Corpus , 2011 .

[4] Silvia Bernardini,et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[5] Nelleke Oostdijk,et al. From D-Coi to SoNaR: a reference corpus for Dutch , 2008, LREC.

[6] Stefan Evert,et al. How Random is a Corpus? The Library Metaphor , 2006 .

[7] Maarten Marx,et al. DutchParl. The Parliamentary Documents in Dutch , 2010, LREC.

[8] Véronique Hoste,et al. Interacting Semantic Layers of Annotation in SoNaR, a Reference Corpus of Contemporary Written Dutch , 2010, LREC.

[9] Orphée De Clercq,et al. Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus , 2010, LREC.

[10] Klaus U. Schulz,et al. Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[11] Martin Reynaert,et al. Non-interactive OCR Post-correction for Giga-Scale Digitization Projects , 2008, CICLing.