论文信息 - Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings - 字舞流文

Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings

We propose a new method for extracting pseudo-parallel sentences from a pair of large monolingual corpora, without relying on any document-level information. Our method first exploits word embeddings in order to efficiently evaluate trillions of candidate sentence pairs and then a classifier to find the most reliable ones. We report significant improvements in domain adaptation for statistical machine translation when using a translation model trained on the sentence pairs extracted from in-domain monolingual corpora.

Atsushi Fujita | Benjamin Marie | Atsushi Fujita | Benjamin Marie

[1] Yonatan Belinkov,et al. Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[2] Quoc V. Le,et al. Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[3] Guillaume Wenzek,et al. Trans-gram, Fast Cross-lingual Word-embeddings , 2015, EMNLP.

[4] Daniel Jurafsky,et al. A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[5] Pascale Fung,et al. Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[6] Dragos Stefan Munteanu,et al. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[7] Christoph Tillmann,et al. A Simple Sentence-Level Extraction Algorithm for Comparable Data , 2009, NAACL.

[8] Stephan Vogel,et al. Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9] Sabine Hunsicker,et al. Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.

[10] Anna Korhonen,et al. On the Role of Seed Lexicons in Learning Bilingual Word Embeddings , 2016, ACL.

[11] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12] Hitoshi Isahara,et al. Reliable Measures for Aligning Japanese-English News Articles and Sentences , 2003, ACL.

[13] Mamoru Komachi,et al. Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings , 2016, COLING.

[14] Hiroshi Kanayama,et al. Learning Crosslingual Word Embeddings without Bilingual Corpora , 2016, EMNLP.

[15] Alexander M. Fraser,et al. Domain Adaptation in Machine Translation : Final Report , 2013 .

[16] Alon Lavie,et al. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[17] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[18] Christopher D. Manning,et al. Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[19] Matt Post,et al. Domain Adaptation , 2017, Encyclopedia of Machine Learning and Data Mining.

[20] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21] Stephan Vogel,et al. Extracting parallel phrases from comparable data for machine translation† , 2016, Natural Language Engineering.

[22] Holger Schwenk,et al. Parallel sentence generation from comparable corpora for improved SMT , 2011, Machine Translation.

[23] Ulrich Germann. Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[24] Markus Freitag,et al. Fast Domain Adaptation for Neural Machine Translation , 2016, ArXiv.

[25] George F. Foster,et al. The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance , 2012, AMTA.

[26] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.