H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings

This paper presents our solution for the BUCC 2018 Shared Task on parallel sentence extraction from comparable corpora. Our system identifies parallel sentence pairs in French-English corpora by following a hybrid approach pairing multilingual sentence-level embeddings, neural machine translation, and supervised classification. Our system consists of a two-step process. In the first step, to reduce the size and the noise of the candidate sentence pairs, we filter the target translation candidates using the continuous vector representation of each source-target sentence pair learned using a bilingual distributed representation model. Then we select the best translation using a neural machine translation system or a binary classification model. We achieve an F1-score of up to 75.2 and 76.0 on the BUCC18 train and test sets respectively.

[1]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3]  Pascale Fung,et al.  Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora , 2005, IJCNLP.

[4]  Mirella Lapata,et al.  Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora , 2007, ACL.

[5]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[6]  Kemal Oflazer,et al.  SuMT: A Framework of Summarization and MT , 2013, IJCNLP.

[7]  Yonatan Belinkov,et al.  Translating Dialectal Arabic to English , 2013, ACL.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Nizar Habash,et al.  Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation , 2013, ACL 2013.

[10]  Pierre Zweigenbaum,et al.  Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge , 2013, EMNLP.

[11]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[12]  Alberto Barrón-Cedeño,et al.  A Factory of Comparable Corpora from Wikipedia , 2015, BUCC@ACL/IJCNLP.

[13]  Lucia Specia,et al.  Multi-level Translation Quality Prediction with QuEst++ , 2015, ACL.

[14]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[15]  Ignacio Iacobacci,et al.  SensEmbed: Learning Sense Embeddings for Word and Relational Similarity , 2015, ACL.

[16]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[17]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[18]  Pierre Zweigenbaum,et al.  Recent advances in machine translation using comparable corpora , 2016, Natural Language Engineering.

[19]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[20]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[21]  Olivier Pietquin,et al.  MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP , 2016, LREC.

[22]  Guillaume Lample,et al.  Massively Multilingual Word Embeddings , 2016, ArXiv.

[23]  Nizar Habash,et al.  Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings , 2016, COLING.

[24]  Philippe Langlais,et al.  BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[25]  Yang Liu,et al.  Joint training for pivot-based neural machine translation , 2017, IJCAI 2017.

[26]  Graham Neubig,et al.  Cross-Lingual Word Embeddings for Low-Resource Language Modeling , 2017, EACL.

[27]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[28]  Sandra M. Aluísio,et al.  Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impairment from Speech Transcripts , 2017, ACL.