Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora

We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score improvement with using 1/5 of the data over using the entire corpus.

[1]  Anthony Rousseau,et al.  XenC: An Open-Source Tool for Data Selection in Natural Language Processing , 2013, Prague Bull. Math. Linguistics.

[2]  Ming Zhou,et al.  Bilingual Data Cleaning for SMT using Graph-based Random Walk , 2013, ACL.

[3]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[4]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[5]  Miquel Espl,et al.  Bitextor, a free/open-source software to harvest translation memories from multilingual websites , 2009 .

[6]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[7]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[8]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[9]  Lucia Specia,et al.  Quality estimation for translation selection , 2014, EAMT.

[10]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[11]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[12]  Shahram Khadivi,et al.  Parallel Corpus Refinement as an Outlier Detection Algorithm , 2011, MTSUMMIT.

[13]  Alon Lavie,et al.  The CMU-Avenue French-English Translation System , 2012, WMT@NAACL-HLT.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Qun Liu,et al.  Improving Statistical Machine Translation Performance by Training Data Selection and Optimization , 2007, EMNLP-CoNLL.

[16]  Kevin Duh,et al.  Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation , 2013, ACL.

[17]  Michel Simard Clean data for training statistical MT: the case of MT contamination , 2014, AMTA.

[18]  Marianna J. Martindale,et al.  Class-based N-gram language difference models for data selection , 2015, IWSLT.

[19]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.