Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings

We propose a new method for extracting pseudo-parallel sentences from a pair of large monolingual corpora, without relying on any document-level information. Our method first exploits word embeddings in order to efficiently evaluate trillions of candidate sentence pairs and then a classifier to find the most reliable ones. We report significant improvements in domain adaptation for statistical machine translation when using a translation model trained on the sentence pairs extracted from in-domain monolingual corpora.

[1]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[2]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[3]  Guillaume Wenzek,et al.  Trans-gram, Fast Cross-lingual Word-embeddings , 2015, EMNLP.

[4]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[5]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[6]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[7]  Christoph Tillmann,et al.  A Simple Sentence-Level Extraction Algorithm for Comparable Data , 2009, NAACL.

[8]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Sabine Hunsicker,et al.  Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.

[10]  Anna Korhonen,et al.  On the Role of Seed Lexicons in Learning Bilingual Word Embeddings , 2016, ACL.

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Hitoshi Isahara,et al.  Reliable Measures for Aligning Japanese-English News Articles and Sentences , 2003, ACL.

[13]  Mamoru Komachi,et al.  Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings , 2016, COLING.

[14]  Hiroshi Kanayama,et al.  Learning Crosslingual Word Embeddings without Bilingual Corpora , 2016, EMNLP.

[15]  Alexander M. Fraser,et al.  Domain Adaptation in Machine Translation : Final Report , 2013 .

[16]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[17]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[18]  Christopher D. Manning,et al.  Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[19]  Matt Post,et al.  Domain Adaptation , 2017, Encyclopedia of Machine Learning and Data Mining.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Stephan Vogel,et al.  Extracting parallel phrases from comparable data for machine translation† , 2016, Natural Language Engineering.

[22]  Holger Schwenk,et al.  Parallel sentence generation from comparable corpora for improved SMT , 2011, Machine Translation.

[23]  Ulrich Germann Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[24]  Markus Freitag,et al.  Fast Domain Adaptation for Neural Machine Translation , 2016, ArXiv.

[25]  George F. Foster,et al.  The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance , 2012, AMTA.

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.