The Speechmatics Parallel Corpus Filtering System for WMT18

Our entry to the parallel corpus filtering task uses a two-step strategy. The first step uses a series of pragmatic hard ‘rules’ to remove the worst example sentences. This first step reduces the effective corpus size down from the initial 1 billion to 160 million tokens. The second step uses four different heuristics weighted to produce a score that is then used for further filtering down to 100 or 10 million tokens. Our final system produces competitive results without requiring excessive fine tuning to the exact task or language pair. The first step in isolation provides a very fast filter that gives most of the gains of the final system.

[1]  Tony Robinson,et al.  Scaling recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[3]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[4]  Lucia Specia,et al.  Data selection for discriminative training in statistical machine translation , 2014, EAMT.

[5]  Keh-Jiann Chen,et al.  Chinese language model adaptation based on document classification and multiple domain-specific language models , 1997, EUROSPEECH.

[6]  Alon Lavie,et al.  The CMU-Avenue French-English Translation System , 2012, WMT@NAACL-HLT.

[7]  Jungi Kim,et al.  Boosting Neural Machine Translation , 2016, IJCNLP.

[8]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[9]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.

[10]  Judith Gaspers,et al.  Selecting Machine-Translated Data for Quick Bootstrapping of a Natural Language Understanding System , 2018, NAACL-HLT.

[11]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[12]  Hermann Ney,et al.  Automatic Filtering of Bilingual Corpora for Statistical Machine Translation , 2005, NLDB.

[13]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.