论文信息 - Noisy Parallel Corpus Filtering through Projected Word Embeddings - 字舞流文

Noisy Parallel Corpus Filtering through Projected Word Embeddings

We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spit ...

Robert Östling | Murathan Kurfali

[1] Guillaume Lample,et al. Word Translation Without Parallel Data , 2017, ICLR.

[2] Jörg Tiedemann,et al. Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[3] Philipp Koehn,et al. Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[4] Houda Bouamor,et al. H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , 2018, BUCC@LREC.

[5] Raivis Skadins,et al. Word Alignment Based Parallel Corpora Evaluation and Cleaning Using Machine Learning Techniques , 2015, EAMT.

[6] Huda Khayrallah,et al. Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering , 2018, WMT.

[7] Marcis Pinnis,et al. Tilde’s Parallel Corpus Filtering Methods for WMT 2018 , 2018, WMT.

[8] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[9] Alexandros Nanopoulos,et al. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[10] Holger Schwenk,et al. Filtering and Mining Parallel Data in a Joint Multilingual Space , 2018, ACL.

[11] Philipp Koehn,et al. Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora , 2017, EMNLP.