论文信息 - Impact of Corpora Quality on Neural Machine Translation

Impact of Corpora Quality on Neural Machine Translation

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them. The solutions are summarised in a set of scripts that remove problematic sentences from input corpora.

Matiss Rikters | Matīss Rikters

[1] Timothy Baldwin,et al. langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[2] Philipp Koehn,et al. Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora , 2017, EMNLP.

[3] Krzysztof Wolk,et al. Computer Science , 2021 .

[4] Hermann Ney,et al. Automatic Filtering of Bilingual Corpora for Statistical Machine Translation , 2005, NLDB.

[5] Matt Post,et al. We start by defining the recurrent architecture as implemented in S OCKEYE , following , 2018 .

[6] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[7] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.