Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling

It is well known that the output quality of statistical machine translation (SMT) systems increases with more training data. To obtain more parallel text for translation modeling, researchers have turned to the web to mine parallel sentences, but most previous approaches have avoided the difficult problem of pairwise similarity on cross-lingual documents and instead rely on heuristics. In contrast, we confront this challenge head on using the MapReduce framework. On a modest cluster, our scalable end-to-end processing pipeline was able to automatically gather 5.8m parallel sentence pairs from English and German Wikipedia. Augmenting existing bitext with these data yielded significant improvements over a state-of-the-art baseline (2.39 BLEU points in the best case).

[1]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[2]  Jimmy J. Lin,et al.  No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[3]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[4]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[5]  Alon Lavie,et al.  Improved Features and Grammar Selection for Syntax-Based MT , 2010, WMT@ACL.

[6]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[7]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[8]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[9]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[10]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[11]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[12]  Vladimir Eidelman,et al.  The University of Maryland Statistical Machine Translation System for the Fourth Workshop on Machine Translation , 2009, WMT@ACL.

[13]  Jimmy J. Lin,et al.  Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce , 2008, WMT@ACL.

[14]  Jimmy J. Lin,et al.  Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search , 2009, TREC.

[15]  Douglas W. Oard,et al.  Probabilistic structured query methods , 2003, SIGIR.