The United Nations Parallel Corpus v1.0

This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus.

[1]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[2]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[3]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[6]  Bruno Pouliquen,et al.  Large-scale multiple language translation accelerator at the United Nations , 2013 .

[7]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[8]  Nadir Durrani,et al.  Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT? , 2013, ACL.

[9]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[10]  Filip Gralinski,et al.  PSI-Toolkit: A Natural Language Processing Pipeline , 2013, Computational Linguistics - Applications.

[11]  Robert Dale,et al.  United Nations General Assembly Resolutions : a six-language parallel corpus , 2009 .

[12]  Bruno Pouliquen,et al.  SMT at the International Maritime Organization: experiences with combining in-house corpora with out-of-domain corpora , 2015, EAMT.

[13]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[14]  Bruno Pouliquen,et al.  Tapta: A user-driven translation system for patent documents based on domain-aware Statistical Machine Translation , 2011, EAMT.

[15]  Marcin Junczys-Dowmunt,et al.  Phrasal Rank-Encoding: Exploiting Phrase Redundancy and Translational Relations for Phrase Table Compression , 2012, Prague Bull. Math. Linguistics.

[16]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[17]  Bruno Pouliquen,et al.  Statistical Machine Translation prototype using UN parallel documents , 2012, EAMT.

[18]  Andreas Eisele,et al.  MultiUN v2: UN Documents with Multilingual Alignments , 2012, LREC.

[19]  Rico Sennrich,et al.  Iterative, MT-based Sentence Alignment of Parallel Texts , 2011, NODALIDA.