Enhancing Access to Online Education: Quality Machine Translation of MOOC Content

The present work is an overview of the TraMOOC (Translation for Massive Open Online Courses) research and innovation project, a machine translation approach for online educational content. More specifically, videolectures, assignments, and MOOC forum text is automatically translated from English into eleven European and BRIC languages. Unlike previous approaches to machine translation, the output quality in TraMOOC relies on a multimodal evaluation schema that involves crowdsourcing, error type markup, an error taxonomy for translation model comparison, and implicit evaluation via text mining, i.e. entity recognition and its performance comparison between the source and the translated text, and sentiment analysis on the students' forum posts. Finally, the evaluation output will result in more and better quality in-domain parallel data that will be fed back to the translation engine for higher quality output. The translation service will be incorporated into the Iversity MOOC platform and into the VideoLectures.net digital library portal

[1]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[2]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[3]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[4]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Alexandra Birch,et al.  Mixed domain vs. multi-domain statistical machine translation , 2015, MTSUMMIT.

[7]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[8]  Jaime G. Carbonell,et al.  Active Learning and Crowd-Sourcing for Machine Translation , 2010, LREC.

[9]  Alexandru Ceausu,et al.  South-East European Times : A parallel corpus of Balkan languages , Francis Tyers and , 2010 .

[10]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[11]  Christian Wartena,et al.  Keyword Extraction Using Word Co-occurrence , 2010, 2010 Workshops on Database and Expert Systems Applications.

[12]  Christopher D. Manning,et al.  A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[15]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[16]  Lynette Hirschman,et al.  Name Translation as a Machine Translation Evaluation Task , 2006 .

[17]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[18]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[19]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[20]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[21]  Ahmed Abdelali,et al.  The AMARA Corpus: Building Parallel Language Resources for the Educational Domain , 2014, LREC.

[22]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.

[23]  Nadir Durrani,et al.  Model With Minimal Translation Units, But Decode With Phrases , 2013, HLT-NAACL.