Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data

We introduce a purely monolingual approach to filtering for parallel data from a noisy corpus in a low-resource scenario. Our work is inspired by Junczys-Dowmunt (2018), but we relax the requirements to allow for cases where no parallel data is available. Our primary contribution is a dual monolingual crossentropy delta criterion modified from Cynical data selection (Axelrod, 2017), and is competitive (within 1.8 BLEU) with the best bilingual filtering method when used to train SMT systems. Our approach is featherweight, and runs end-to-end on a standard laptop in three hours.

[1]  Jeremy Gwinnup,et al.  Coverage and Cynicism: The AFRL Submission to the WMT 2018 Parallel Corpus Filtering Task , 2018, WMT.

[2]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[3]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[4]  Amittai Axelrod Cynical Selection of Language Model Training Data , 2017, ArXiv.

[5]  Holger Schwenk,et al.  Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , 2018, ACL.

[6]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[7]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[8]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[9]  Huda Khayrallah,et al.  Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering , 2018, WMT.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Philipp Koehn,et al.  Findings of the WMT 2016 Bilingual Document Alignment Shared Task , 2016, WMT.

[12]  Thierry Etchegoyhen,et al.  STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering , 2018, WMT.

[13]  Amittai Axelrod,et al.  Data Selection with Cluster-Based Language Difference Models and Cynical Selection , 2019, ArXiv.

[14]  Philipp Koehn,et al.  Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions , 2019, WMT.

[15]  Panayiotis G. Georgiou,et al.  Text data acquisition for domain-specific language models , 2006, EMNLP.

[16]  Michel Simard,et al.  Alibaba Submission to the WMT18 Parallel Corpus Filtering Task , 2018, WMT.

[17]  Kevin Duh,et al.  Curriculum Learning for Domain Adaptation in Neural Machine Translation , 2019, NAACL.

[18]  Marcin Junczys-Dowmunt,et al.  Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora , 2018, WMT.