Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.

[1]  Rico Sennrich,et al.  When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion , 2019, ACL.

[2]  Josef van Genabith,et al.  Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification , 2016, NAACL.

[3]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[4]  Toshiaki Nakazawa,et al.  ASPEC: Asian Scientific Paper Excerpt Corpus , 2016, LREC.

[5]  Rico Sennrich,et al.  Improving Machine Translation of Educational Content via Crowdsourcing , 2018, LREC.

[6]  Ahmed Abdelali,et al.  The AMARA Corpus: Building Parallel Language Resources for the Educational Domain , 2014, LREC.

[7]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[8]  Ciprian Chelba,et al.  Dynamically Composing Domain-Data Selection with Clean-Data Selection by “Co-Curricular Learning” for Neural Machine Translation , 2019, ACL.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Christopher D. Manning,et al.  Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[11]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[14]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[15]  Chao-Hong Liu,et al.  Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts , 2018, LREC.

[16]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[17]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[18]  Xing Wang,et al.  Exploiting Sentential Context for Neural Machine Translation , 2019, ACL.

[19]  Yuji Matsumoto,et al.  Bilingual Text, Matching using Bilingual Dictionary and Statistics , 1994, COLING.

[20]  Graham Neubig,et al.  Target Conditioned Sampling: Optimizing Data Selection for Multilingual Neural Machine Translation , 2019, ACL.

[21]  Guodong Zhou,et al.  Hierarchical Modeling of Global Context for Document-Level Neural Machine Translation , 2019, EMNLP.

[22]  Daisuke Kawahara,et al.  Juman++: A Morphological Analysis Toolkit for Scriptio Continua , 2018, EMNLP.

[23]  Rico Sennrich,et al.  MT-based Sentence Alignment for OCR-generated Parallel Texts , 2010, AMTA.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Rico Sennrich,et al.  Iterative, MT-based Sentence Alignment of Parallel Texts , 2011, NODALIDA.

[26]  Huda Khayrallah,et al.  Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation , 2019, NAACL.

[27]  Houda Bouamor,et al.  H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , 2018, BUCC@LREC.

[28]  Eiichiro Sumita,et al.  Multilingual Parallel Corpus for Global Communication Plan , 2018, LREC.

[29]  Chenhui Chu,et al.  An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation , 2017, ACL.

[30]  Philipp Koehn,et al.  Neural Machine Translation , 2017, ArXiv.

[31]  Francisco Guzmán,et al.  Amara: A Sustainable, Global Solution for Accessibility, Powered by Communities of Volunteers , 2014, HCI.

[32]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[33]  Krzysztof Wolk,et al.  Computer Science , 2021 .

[34]  Andy Way,et al.  Enhancing Access to Online Education: Quality Machine Translation of MOOC Content , 2016, LREC.

[35]  Raj Dabre,et al.  Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation , 2019, EMNLP.

[36]  Andy Way,et al.  TraMOOC: Translation for Massive Open Online Courses , 2015, EAMT.

[37]  Philipp Koehn,et al.  Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings , 2019, WMT.

[38]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[39]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[40]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[41]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[42]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[43]  Jörg Tiedemann,et al.  Finding Alternative Translations in a Large Corpus of Movie Subtitle , 2016, LREC.

[44]  Raj Dabre,et al.  Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation , 2019, MTSummit.