Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus

After three years of work the Dutch Parallel Corpus (DPC) project has reached an end. The finalized corpus is a ten-million-word high-quality sentence-aligned bidirectional parallel corpus of Dutch, English and French, with Dutch as central language. In this paper we present the corpus and try to formulate some basic data collection principles, based on the work that was carried out for the project. Building a corpus is a difficult and time-consuming task, especially when every text sample included has to be cleared from copyrights. The DPC is balanced according to five text types (literature, journalistic texts, instructive texts, administrative texts and texts treating external communication) and four translation directions (Dutch-English, English-Dutch, Dutch-French and French-Dutch). All the text material was cleared from copyrights. The data collection process necessitated the involvement of different text providers, which resulted in drawing up four different licence agreements. Problems such as an unknown source language, copyright issues and changes to the corpus design are discussed in close detail and illustrated with examples so as to be of help to future corpus compilers.

[1]  Tony McEnery,et al.  A Glossary of Corpus Linguistics , 2006 .

[2]  Franciska de Jong,et al.  Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus , 2010, LREC.

[3]  Nelleke Oostdijk,et al.  From D-Coi to SoNaR: a reference corpus for Dutch , 2008, LREC.

[4]  Maribel Montero Perez,et al.  Designing a parallel corpus as a multifunctional translator's aid , 2008 .

[5]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[6]  K. Kraus,et al.  The DWDS corpus: A reference corpus for the German language of the 20 century , 2006 .

[7]  Hans Paulussen,et al.  Dutch parallel corpus : a multilingual annotated corpus , 2007 .

[8]  Graeme D. Kennedy,et al.  Book Reviews: An Introduction to Corpus Linguistics , 1999, CL.

[9]  Tony McEnery,et al.  Corpus-Based Language Studies: An Advanced Resource Book , 2006 .

[10]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[11]  Nelleke Oostdijk,et al.  Linguistic Annotation of the Spoken Dutch Corpus: If We Had To Do It All Over Again , 2004, LREC.

[12]  Stig Johansson,et al.  English-Norwegian Parallel Corpus: Manual , 2010 .

[13]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[14]  Stefan Th. Gries,et al.  What is Corpus Linguistics? , 2009, Lang. Linguistics Compass.

[15]  Richard Xiao,et al.  Corpus Creation , 2010, Handbook of Natural Language Processing.

[16]  Nianwen Xue,et al.  Linguistic Annotation , 2009 .