论文信息 - Europarl: A Parallel Corpus for Statistical Machine Translation

Europarl: A Parallel Corpus for Statistical Machine Translation

We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translation (SMT). We trained SMT systems for 110 language pairs, which reveal interesting clues into the challenges ahead.

Philipp Koehn | Philipp Koehn

[1] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[2] PietraVincent J. Della,et al. The mathematics of statistical machine translation , 1993 .

[3] Marti A. Hearst,et al. Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[4] Adwait Ratnaparkhi,et al. A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[5] I. Dan Melamed,et al. Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[6] Craig A. Knoblock,et al. A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[7] Philip Resnik,et al. Mining the Web for Bilingual Text , 1999, ACL.

[8] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.