Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

This paper presents the Dutch Parallel Corpus, a high-quality parallel corpus for Dutch, French and English consisting of more than ten million words. The corpus contains five different text types and is balanced with respect to text type and translation direction. All texts included in the corpus have been cleared from copyright. We discuss the importance of parallel corpora in various research domains and contrast the Dutch Parallel Corpus with existing parallel corpora. The Dutch Parallel Corpus distinguishes itself from other parallel corpora by having a balanced composition and by its availability to the wide research community, thanks to its copyright clearance. All texts in the corpus are sentence-aligned and further enriched with basic linguistic annotations (lemmas and word class information). Approximately 25,000 words of the Dutch-English part have been manually aligned at the sub-sentential level. Rich metadata facilitates the navigability of the corpus and enables users to select the texts that satisfy their needs. The entire corpus is released as full texts in XML format and is also available via a web interface, which supports basic and complex search queries and presents the results as parallel concordances. The corpus will be distributed by the Flemish-Dutch Human Language Technology Agency (TST-Centrale).

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Tony McEnery,et al.  Parallel and comparable corpora: What is happening? , 2007 .

[3]  Margaret Rogers,et al.  Incorporating corpora: The linguist and the translator , 2008 .

[4]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[5]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[6]  Mona Baker,et al.  Text and technology : in honour of John Sinclair , 1993 .

[7]  Alexandre Allauzen,et al.  Training and Evaluation of POS Taggers on the French MULTITAG Corpus , 2008, LREC.

[8]  Chunshen Zhu UT once more: the sentence as the key functional unit of translation: the sentence as the key functional unit of translation , 1999 .

[9]  Lieve Macken Sub-sentential alignment of translational correspondences , 2010 .

[10]  Patrick Paroubek Language Resources as by-Product of Evaluation: The MULTITAG Example , 2000, LREC.

[11]  Stig Johansson,et al.  Seeing through Multilingual Corpora , 2007 .

[12]  Orphée De Clercq,et al.  Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus , 2010, LREC.

[13]  Anke Lüdeling,et al.  Corpus Linguistics: An International Handbook , 2009 .

[14]  Andy Way,et al.  Recent Advances in Example-Based Machine Translation , 2004 .

[15]  Tony McEnery,et al.  Corpus-Based Language Studies: An Advanced Resource Book , 2006 .

[16]  Maeve Olohan,et al.  Introducing Corpora in Translation Studies , 2004 .

[17]  Lieve Macken,et al.  An Annotation Scheme and Gold Standard for Dutch-English Word Alignment , 2010, LREC.

[18]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[19]  Walter Daelemans,et al.  Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus , 2000, LREC.

[20]  Jarle Ebeling,et al.  Contrastive Linguistics, Translation, and Parallel Corpora , 1998 .

[21]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[22]  Anthony McEnery,et al.  Multilingual Corpora In Teaching And Research. , 2000 .

[23]  Antal van den Bosch,et al.  Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development , 2006, LREC.

[24]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[25]  Mona Baker 'Corpora in Translation Studies: An Overview and Some Suggestions for Future Research' , 1995 .

[26]  Martin Wynne,et al.  Developing Linguistic Corpora: a Guide to Good Practice , 2005 .

[27]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[28]  Stephen D. Richardson Machine Translation: From Research to Real Users , 2002, Lecture Notes in Computer Science.

[29]  M. Wong Seeing through multilingual corpora: On the use of corpora in contrastive studies , 2009 .

[30]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[31]  Kaisa Koskinen,et al.  Institutional Illusions , 2000 .

[32]  S. Johansson Seeing through Multilingual Corpora: On the Use of Corpora in Contrastive Studies , 2007 .

[33]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[34]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[35]  Sandra L. Halverson,et al.  Translation studies and representative corpora: establishing links between translation corpora, theoretical/descriptive categories and a conception of the object of study: establishing links between translation corpora, theoretical/descriptive categories and a conception of the object of study , 1998 .

[36]  Guy Deville,et al.  Génération de corpus multilingues dans la mise en oeuvre d'un outil en ligne d'aide à la lecture de textes en langue étrangère , 2004 .

[37]  Richard Xiao Well-known and influential corpora , 2008 .

[38]  Richard Xiao Corpus Creation , 2010, Handbook of Natural Language Processing.

[39]  I. Dan Melamed A Portable Algorithm for Mapping Bitext Correspondence , 1997, ACL.

[40]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.