Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment

The quality of a statistical machine translation (SMT) system is heavily dependent upon the amount of parallel sentences used in training. In recent years, there have been several approaches developed for obtaining parallel sentences from non-parallel, or comparable data, such as news articles published within the same time period (Munteanu and Marcu, 2005), or web pages with a similar structure (Resnik and Smith, 2003). One resource not yet thoroughly explored is Wikipedia, an online encyclopedia containing linked articles in many languages. We advance the state of the art in parallel sentence extraction by modeling the document level alignment, motivated by the observation that parallel sentence pairs are often found in close proximity. We also include features which make use of the additional annotation given by Wikipedia, and features using an automatically induced lexicon model. Results for both accuracy in sentence extraction and downstream improvement in an SMT system are presented.

[1]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[2]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[4]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[5]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[6]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[7]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[8]  Pascale Fung,et al.  Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus , 2004, COLING.

[9]  Phil Blunsom,et al.  Discriminative Word Alignment with Conditional Random Fields , 2006, ACL.

[10]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[11]  Christoph Tillmann,et al.  A Simple Sentence-Level Extraction Algorithm for Comparable Data , 2009, NAACL.

[12]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[15]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[16]  Christoph Tillmann,et al.  A Beam-Search Extraction Algorithm for Comparable Data , 2009, ACL.

[17]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[20]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[21]  Weblog Wikipedia,et al.  In Wikipedia the Free Encyclopedia , 2005 .

[22]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[23]  Peter Clark,et al.  The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.

[24]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.