Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions

Following the WMT 2018 Shared Task on Parallel Corpus Filtering (Koehn et al., 2018), we posed the challenge of assigning sentencelevel quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting 2% and 10% of the highest-quality data to be used to train machine translation systems. This year, the task tackled the low resource condition of Nepali– English and Sinhala–English. Eleven participants from companies, national research labs, and universities participated in this task.

[1]  Amittai Axelrod,et al.  Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data , 2019, WMT.

[2]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3]  Josep Maria Crego,et al.  SYSTRAN Participation to the WMT2018 Shared Task on Parallel Corpus Filtering , 2018, WMT.

[4]  Víctor M. Sánchez-Cartagena,et al.  Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task , 2018, WMT.

[5]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[6]  Wolfgang Täger The Sentence-Aligned European Patent Corpus , 2011, EAMT.

[7]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Germán Sanchis-Trilles,et al.  Filtering of Noisy Parallel Corpora Based on Hypothesis Generation , 2019, WMT.

[10]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[11]  Jesús González-Rubio Webinterpret Submission to the WMT2019 Shared Task on Parallel Corpus Filtering , 2019, WMT.

[12]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[13]  Alexander M. Fraser,et al.  An Unsupervised System for Parallel Corpus Filtering , 2018, WMT.

[14]  Jörg Tiedemann,et al.  The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task , 2019, WMT.

[15]  Gabriel Bernier-Colborne,et al.  NRC Parallel Corpus Filtering System for WMT 2019 , 2019, WMT.

[16]  Will Williams,et al.  The Speechmatics Parallel Corpus Filtering System for WMT18 , 2018, WMT.

[17]  Huda Khayrallah,et al.  Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering , 2018, WMT.

[18]  Gustavo Paetzold UTFPR at WMT 2018: Minimalistic Supervised Corpora Filtering for Machine Translation , 2018, WMT.

[19]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[20]  Keith Stevens,et al.  Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , 2018, WMT.

[21]  Philipp Koehn,et al.  Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings , 2019, WMT.

[22]  Philipp Koehn,et al.  Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora , 2017, EMNLP.

[23]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[24]  Ming Zhou,et al.  Bilingual Data Cleaning for SMT using Graph-based Random Walk , 2013, ACL.

[25]  Christof Monz,et al.  Dynamic Data Selection for Neural Machine Translation , 2017, EMNLP.

[26]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[27]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[28]  Houda Bouamor,et al.  H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , 2018, BUCC@LREC.

[29]  Taro Watanabe,et al.  Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection , 2018, WMT.

[30]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[31]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[32]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[33]  Philipp Koehn,et al.  Findings of the WMT 2016 Bilingual Document Alignment Shared Task , 2016, WMT.

[34]  Juri Ganitkevitch,et al.  Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation. , 2011, EMNLP.

[35]  Jeremy Gwinnup,et al.  Quality and Coverage: The AFRL Submission to the WMT19 Parallel Corpus Filtering for Low-Resource Conditions Task , 2019, WMT.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Pushpak Bhattacharyya,et al.  The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[38]  Matt Post,et al.  The Language Demographics of Amazon Mechanical Turk , 2014, TACL.

[39]  Marcin Junczys-Dowmunt,et al.  Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora , 2018, WMT.

[40]  Hermann Ney,et al.  The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task , 2018, WMT.

[41]  Marcis Pinnis,et al.  Tilde’s Parallel Corpus Filtering Methods for WMT 2018 , 2018, WMT.

[42]  Holger Schwenk,et al.  Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , 2018, ACL.

[43]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[44]  Marine Carpuat,et al.  Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation , 2017, NMT@ACL.

[45]  George F. Foster,et al.  Reinforcement Learning based Curriculum Optimization for Neural Machine Translation , 2019, NAACL.

[46]  Pushpak Bhattacharyya,et al.  Parallel Corpus Filtering Based on Fuzzy String Matching , 2019, WMT.

[47]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[48]  Robert Dale,et al.  United Nations General Assembly Resolutions : a six-language parallel corpus , 2009 .

[49]  Michel Simard,et al.  Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task , 2018, WMT.

[50]  Michel Simard,et al.  Alibaba Submission to the WMT18 Parallel Corpus Filtering Task , 2018, WMT.

[51]  Yandex,et al.  Building a Web-based parallel corpus and filtering out machine-translated text , 2011 .

[52]  Chris Quirk,et al.  MT Detection in Web-Scraped Parallel Corpora , 2011, MTSUMMIT.

[53]  Robert Östling,et al.  Noisy Parallel Corpus Filtering through Projected Word Embeddings , 2019, WMT.

[54]  Shahram Khadivi,et al.  Parallel Corpus Refinement as an Outlier Detection Algorithm , 2011, MTSUMMIT.