论文信息 - The Sentence-Aligned European Patent Corpus

The Sentence-Aligned European Patent Corpus

This paper describes the creation and the content of the Sentence-Aligned European Patent Corpus. The corpus contains more than 130 million sentence pairs for 6 European languages. With more than 76 million sentence pairs, to our knowledge, the EN-DE sub corpus is the largest bilingual sentence-aligned corpus. For other language pairs, work has started to obtain sub corpora of similar size. The error rate of sentence alignment was very low even in the absence of language specific resources.

Wolfgang Täger

[1] Oi Yee Kwong,et al. Mining Large-scale Parallel Corpora from Multilingual Patents: An English-Chinese example and its application to SMT , 2010, CIPS-SIGHAN.

[2] Eiichiro Sumita,et al. Method of Selecting Training Data to Build a Compact and Efficient Translation Model , 2008, IJCNLP.

[3] M. Utiyama,et al. A Japanese-English patent parallel corpus , 2007, MTSUMMIT.

[4] Marcus Dobrinkat,et al. Experiments with Domain Adaptation Methods for Statistical MT : From European Parliament Proceedings to Finnish Newspaper Text , 2010 .

[5] Julien Bourdaillet,et al. The RALI Machine Translation System for WMT 2010 , 2010, WMT@ACL.

[6] Andy Way,et al. PLuTO: MT for online patent translation , 2010 .

[7] Franz Josef Och. Statistical Machine Translation: Foundations and Recent Advances , 2005, MTSUMMIT.

[8] Rico Sennrich,et al. MT-based Sentence Alignment for OCR-generated Parallel Texts , 2010, AMTA.

[9] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.