论文信息 - Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data - 字舞流文

Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a lowresource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for lowresource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

Pascale Fung | Vishrav Chaudhary | Ahmed El-Kishky | Mona T. Diab | Mona Diab | Naman Goyal | Wei-Jen Ko | Philipp Koehn | Francisco Guzm'an | Adithya Renduchintala | Naman Goyal | Philipp Koehn | Pascale Fung | Vishrav Chaudhary | Francisco Guzmán | Ahmed El-Kishky | Wei-Jen Ko | Adithya Renduchintala

[1] Matteo Negri,et al. Adapting Multilingual Neural Machine Translation to Unseen Languages , 2019, IWSLT.

[2] Chris Callison-Burch,et al. The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[3] Holger Schwenk,et al. Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[4] Hai Zhao,et al. Reference Language based Unsupervised Neural Machine Translation , 2020, FINDINGS.

[5] Kevin Knight,et al. Deciphering Related Languages , 2017, EMNLP.

[6] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7] Kemal Oflazer,et al. The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[8] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[9] Kenneth Heafield,et al. Zero-Resource Neural Machine Translation with Monolingual Pivot Data , 2019, EMNLP.

[10] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[11] Graham Neubig,et al. Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[12] Yong Wang,et al. Improved Zero-shot Neural Machine Translation via Ignoring Spurious Correlations , 2019, ACL.

[13] Philipp Koehn,et al. Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[14] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[15] Graham Neubig,et al. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.

[16] Graham Neubig,et al. TICO-19: the Translation Initiative for Covid-19 , 2020, NLP4COVID@EMNLP.

[17] Xiaodong Gu,et al. DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder , 2018, ICLR.

[18] Ahmed Abdelali,et al. The AMARA corpus: building resources for translating the web’s educational content , 2013, IWSLT.

[19] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20] Martin Wattenberg,et al. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[21] Xu Tan,et al. MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[22] Yuqing Tang,et al. Cross-lingual Retrieval for Iterative Self-Supervised Training , 2020, NeurIPS.

[23] Orhan Firat,et al. Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages , 2020, NAACL.

[24] Pushpak Bhattacharyya,et al. The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[25] Guillaume Lample,et al. Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[26] Eneko Agirre,et al. Unsupervised Neural Machine Translation , 2017, ICLR.

[27] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[28] Lidia S. Chao,et al. Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling , 2020, AAAI.

[29] Marcello Federico,et al. Neural Machine Translation into Language Varieties , 2018, WMT.

[30] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[31] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[32] Yang Liu,et al. A Teacher-Student Framework for Zero-Resource Neural Machine Translation , 2017, ACL.

[33] Hany Hassan,et al. Synthetic Data for Neural Machine Translation of Spoken-Dialects , 2017, IWSLT.

[34] Philipp Koehn,et al. A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[35] Graham Neubig,et al. Generalized Data Augmentation for Low-Resource Translation , 2019, ACL.

[36] Hai Zhao,et al. Cross-lingual Supervision Improves Unsupervised Neural Machine Translation , 2020, NAACL.

[37] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[38] Ankur P. Parikh,et al. Consistency by Agreement in Zero-Shot Neural Machine Translation , 2019, NAACL.

[39] Ankur Bapna,et al. Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation , 2020, ACL.

[40] Timothy Baldwin,et al. Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[41] Holger Schwenk,et al. CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , 2019, ArXiv.

[42] Guillaume Lample,et al. Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[43] Deniz Yuret,et al. KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[44] Ankur Bapna,et al. The Missing Ingredient in Zero-Shot Neural Machine Translation , 2019, ArXiv.