论文信息 - H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings

H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings

This paper presents our solution for the BUCC 2018 Shared Task on parallel sentence extraction from comparable corpora. Our system identifies parallel sentence pairs in French-English corpora by following a hybrid approach pairing multilingual sentence-level embeddings, neural machine translation, and supervised classification. Our system consists of a two-step process. In the first step, to reduce the size and the noise of the candidate sentence pairs, we filter the target translation candidates using the continuous vector representation of each source-target sentence pair learned using a bilingual distributed representation model. Then we select the best translation using a neural machine translation system or a binary classification model. We achieve an F1-score of up to 75.2 and 76.0 on the BUCC18 train and test sets respectively.

Houda Bouamor | Hassan Sajjad

[1] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3] Pascale Fung,et al. Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora , 2005, IJCNLP.

[4] Mirella Lapata,et al. Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora , 2007, ACL.

[5] Christopher D. Manning,et al. Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[6] Kemal Oflazer,et al. SuMT: A Framework of Summarization and MT , 2013, IJCNLP.

[7] Yonatan Belinkov,et al. Translating Dialectal Arabic to English , 2013, ACL.

[8] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9] Nizar Habash,et al. Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation , 2013, ACL 2013.

[10] Pierre Zweigenbaum,et al. Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge , 2013, EMNLP.

[11] Mihai Surdeanu,et al. The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.