Dutch parallel corpus : a multilingual annotated corpus

Aligned parallel corpora form an indispensable resource for a wide range of multilingual applications, including, among others, machine translation (MT), especially corpus-based MT like statistical MT (Koehn, 2005) and example-based MT (Carl and Way, 2003), computer-assisted translation tools (Hutchins, 2005), multilingual information extraction and computer-assisted language learning (Desmet and Paulussen, 2005). Apart from the more technological applications, parallel corpora can be used to conduct more fundamental research in the fields of contrastive linguistics and translation studies (Baker, 1995; Laviosa, 2002; Olohan, 2004). Since high-quality parallel corpora with Dutch as a central language do not exist or are not accessible for the research community due to copyright restrictions, the compilation of aligned parallel corpora with Dutch as a central language was one of the priorities of the STEVIN program (Odijk et al., 2004). The Dutch Parallel Corpus (DPC) project aims at fulfilling this need. Within the DPC project, a 10-million-word, high-quality, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French is being compiled. The corpus will be enriched with linguistic annotations: part-of-speech and lemmatization information for the whole corpus and syntactic analysis for a subpart of the corpus. As the corpus is bidirectional (Dutch as source and target language), the corpus can also be used as a comparable corpus (to compare texts originally written in Dutch with translated Dutch texts). A part of the corpus is trilingual and contains Dutch texts translated into both English and French. To guarantee the quality of the corpus and its multifunctional availability for the wide research community, each step in compiling, structuring and annotating the corpus is being validated by a user group of specialists in linguistics and language technology. In order to make the corpus accessible for the whole research community, a copyright clearance for all samples included in the corpus is being obtained. The DPC-project started in May 2006 and runs until March 2009. The remainder of the paper is organized as follows: Section 2 deals with the specific needs of the different intended users. Section 3 describes in detail the corpus design. Section 4 focuses on the more technical issues of text normalization, alignment and linguistic annotation. Section 5 elaborates on quality control and Section 6 concludes the paper.

[1]  Sara Laviosa,et al.  Corpus-based Translation Studies: Theory, Findings, Applications , 2002 .

[2]  Sara Laviosa Corpus-based translation studies , 2002 .

[3]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[4]  Mona Baker 'Corpora in Translation Studies: An Overview and Some Suggestions for Future Research' , 1995 .

[5]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[6]  Andy Way,et al.  Recent Advances in Example-Based Machine Translation , 2004 .

[7]  Mona Baker,et al.  Corpus-based Translation Studies: The Challenges that Lie Ahead , 1996 .

[8]  Friedrich Ungerer,et al.  An introduction to cognitive linguistics , 1999 .

[9]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[10]  Maeve Olohan,et al.  Introducing Corpora in Translation Studies , 2004 .

[11]  Lieve Macken Analysis of translational correspondence in view of sub-sentential alignment , 2007 .

[12]  Bruno Pouliquen,et al.  Massive multi lingual corpus compilation: Acquis Communautaire and totale , 2005 .

[13]  Hans Paulussen,et al.  CorpusCALL: opportunities and challenges , 2005 .

[14]  Michel Simard,et al.  Studying the Human Translation Process through the TransSearch Log-Files , 2005, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[15]  Guy Deville,et al.  Génération de corpus multilingues dans la mise en oeuvre d'un outil en ligne d'aide à la lecture de textes en langue étrangère , 2004 .

[16]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[17]  Antal van den Bosch,et al.  Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development , 2006, LREC.

[18]  Dan Gervais MultiTrans system presentation: translation support and language management solutions , 2003, MTSUMMIT.

[19]  M. Barlow ParaConc : Concordance Software for Multilingual Parallel Corpora , 2002 .

[20]  Valentin Shevchuk,et al.  Corpus-based translation studies: theory, findings, applications , 2009 .