Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation

We propose a new method for inducing a phrase-based translation model from a pair of unrelated monolingual corpora. Our method is able to deal with phrases of arbitrary length and to find phrase pairs that are useful for statistical machine translation, without requiring large parallel or comparable corpora. First, our method generates phrase pairs through coupling source and target phrases separately collected from respective monolingual data. Then, for each phrase pair, we compute features using the monolingual data and a small quantity of parallel sentences. Finally, incorrect phrase pairs are pruned, and a phrase table is made using the remaining phrase pairs. In our experiments on French--Japanese and Spanish--Japanese translation tasks under low-resource conditions, we observe that incorporating a phrase table induced by our method to the machine translation system leads to large improvements in translation quality. Furthermore, we show that a phrase table induced by our method can also be useful in a wide range of configurations, including configurations where we have already access to large parallel corpora and configurations where only small monolingual corpora are available.

[1]  Marie-Francine Moens,et al.  Bilingual Distributed Word Representations from Document-Aligned Comparable Data , 2015, J. Artif. Intell. Res..

[2]  Anna Korhonen,et al.  On the Role of Seed Lexicons in Learning Bilingual Word Embeddings , 2016, ACL.

[3]  Núria Bel,et al.  Towards producing bilingual lexica from monolingual corpora , 2016, LREC.

[4]  Yang Liu,et al.  Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora , 2015, IJCAI.

[5]  Kevin Knight,et al.  Deciphering Foreign Language , 2011, ACL.

[6]  Kai Zhao,et al.  Learning Translation Models from Monolingual Continuous Representations , 2015, NAACL.

[7]  Chenhui Chu,et al.  Paraphrasing Out-of-Vocabulary Words with Word Embeddings and Semantic Lexicons for Low Resource Statistical Machine Translation , 2016, LREC.

[8]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[9]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[10]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[11]  Manaal Faruqui,et al.  Cross-lingual Models of Word Embeddings: An Empirical Comparison , 2016, ACL.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Marie-Francine Moens,et al.  Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction , 2015, ACL.

[14]  Chris Callison-Burch,et al.  Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals , 2013, NAACL.

[15]  Chris Callison-Burch,et al.  End-to-end statistical machine translation with zero or small parallel texts , 2016, Nat. Lang. Eng..

[16]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[17]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[18]  Hiroshi Kanayama,et al.  Learning Crosslingual Word Embeddings without Bilingual Corpora , 2016, EMNLP.

[19]  Kristina Toutanova,et al.  Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data , 2014, ACL.

[20]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[21]  Mirella Lapata,et al.  Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora , 2007, ACL.

[22]  Eneko Agirre,et al.  Learning principled bilingual mappings of word embeddings while preserving monolingual invariance , 2016, EMNLP.

[23]  Shahram Khadivi,et al.  Neural Machine Translation Leveraging Phrase-based Models in a Hybrid Search , 2017, EMNLP.

[24]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[25]  Andy Way,et al.  Enriching Phrase Tables for Statistical Machine Translation Using Mixed Embeddings , 2016, COLING.

[26]  David Yarowsky,et al.  Toward Statistical Machine Translation without Parallel Corpora , 2012, EACL 2012.

[27]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[28]  Ulrich Germann Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[29]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[30]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[31]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[32]  Hitoshi Isahara,et al.  A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[33]  Kevin Knight,et al.  Large Scale Decipherment for Out-of-Domain Machine Translation , 2012, EMNLP-CoNLL.

[34]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[35]  Hermann Ney,et al.  Deciphering Foreign Language by Combining Language Models and Context Vectors , 2012, ACL.

[36]  Stephan Vogel,et al.  Extracting parallel phrases from comparable data for machine translation† , 2016, Natural Language Engineering.

[37]  Hal Daumé,et al.  Domain Adaptation for Machine Translation by Mining Unseen Words , 2011, ACL.

[38]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[39]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[40]  Guillaume Wenzek,et al.  Trans-gram, Fast Cross-lingual Word-embeddings , 2015, EMNLP.

[41]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[42]  Chris Callison-Burch,et al.  Hallucinating Phrase Translations for Low Resource MT , 2014, CoNLL.