论文信息 - Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced languages and their mono- and multilingual LMs often struggle to obtain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA transfer-learning evaluation question answering tasks show that presumably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.

Maria Khvalchik | Mikhail Galkin

[1] Roberto Basili,et al. Neural Learning for Question Answering in Italian , 2018, AI*IA.

[2] Mikel Artetxe,et al. On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[3] Martin d'Hoffschmidt,et al. FQuAD: French Question Answering Dataset , 2020, FINDINGS.

[4] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5] Marco Valtorta,et al. The Effects of Data Quality on Machine Learning Algorithms , 2006, ICIQ.

[6] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[7] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[8] Leonid Boytsov,et al. SberQuAD - Russian Reading Comprehension Dataset: Description and Analysis , 2019, CLEF.

[9] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[10] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[11] Mark Dredze,et al. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[12] Sebastian Riedel,et al. MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[13] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[16] José A. R. Fonollosa,et al. Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering , 2019, LREC.