Training, Enhancing, Evaluating and Using MT Systems with Comparable Data

This chapter describes how semi-parallel and parallel data extracted from comparable corpora can be used in enhancing machine translation (MT) systems: what are the methods used for this task in statistical and rule-based machine translation systems; what kinds of showcases exist that illustrate the usage of such enhanced MT systems. The impact of data extracted from comparable corpora on MT quality is evaluated for 17 language pairs, and detailed studies involving human evaluation are carried out for 11 language pairs. At first, baseline statistical machine translation (SMT) systems were built using traditional SMT techniques. Then they were improved by the integration of additional data extracted from the comparable corpora. Comparative evaluation was performed to measure improvements. Comparable corpora were also used to enrich the linguistic knowledge of rule-based machine translation (RBMT) systems by applying terminology extraction technology. Finally, SMT systems were adjusted for a narrow domain and included domain-specific knowledge such as terminology, named entities (NEs), domain-specific language models (LMs), etc.

[1]  Marcis Pinnis Latvian and Lithuanian Named Entity Recognition with TildeNER , 2012, LREC.

[2]  Andrei Popescu-Belis,et al.  Reference-based vs. task-based evaluation of human language technology , 2008 .

[3]  Inguna Skadina,et al.  Domain Adaptation in Statistical Machine Translation Using Comparable Corpora: Case Study for English Latvian IT Localisation , 2013, CICLing.

[4]  Boyan Bontchev,et al.  Courseware Authoring for Adaptive E-learning , 2009, 2009 International Conference on Education Technology and Computer.

[5]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[6]  Raymond Mugwanya,et al.  Mobile Learning Content Authoring Tools (MLCATs): A Systematic Review , 2009, AFRICOM.

[7]  William D. Lewis,et al.  Achieving Domain Specificity in SMT without Overt Siloing , 2010, LREC.

[8]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[9]  Andrei Popescu-Belis,et al.  Principles of Context-Based Machine Translation Evaluation , 2002, Machine Translation.

[10]  Ralf Steinmetz,et al.  Future Trends in Game Authoring Tools , 2012, ICEC.

[11]  Wolfgang Müller,et al.  Teaching English as a Second Language Utilizing Authoring Tools for Interactive Digital Storytelling , 2010, ICIDS.

[12]  Nikola Ljubešić,et al.  Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages , 2012 .

[13]  Linda Mitchell,et al.  Evaluation of Machine-Translated User Generated Content: A pilot study based on User Ratings , 2012, EAMT.

[14]  Raivis Skadins,et al.  Improving SMT for Baltic Languages with Factored Models , 2010, Baltic HLT.

[15]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[16]  Inguna Skadina,et al.  ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora , 2012, ACL.

[17]  Chris Callison-Burch,et al.  Combining Bilingual and Comparable Corpora for Low Resource Machine Translation , 2013, WMT@ACL.

[18]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[19]  Inguna Skadina,et al.  Collecting and Using Comparable Corpora for Statistical Machine Translation , 2012, LREC.

[20]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[21]  Andreas Eisele,et al.  DGT-TM: A freely available Translation Memory in 22 languages , 2012, LREC.

[22]  Cécile Roisin,et al.  The limsee3 multimedia authoring model , 2006, DocEng '06.

[23]  Sabine Hunsicker,et al.  Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.

[24]  Ralf Steinberger,et al.  DCEP -Digital Corpus of the European Parliament , 2014, LREC.

[25]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[26]  Marcis Pinnis,et al.  MT Adaptation for Under-Resourced Domains - What Works and What Not , 2012, Baltic HLT.

[27]  Bogdan Babych,et al.  Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents , 2012, ESIRMT/HyTra@EACL.

[28]  Inguna Skadina,et al.  A Collection of Comparable Corpora for Under-resourced Languages , 2010, Baltic HLT.

[29]  Susanne Boll,et al.  Context-driven smart authoring of multimedia content with xSMART , 2005, MULTIMEDIA '05.

[30]  Holger Schwenk,et al.  Parallel sentence generation from comparable corpora for improved SMT , 2011, Machine Translation.

[31]  Gregor Thurmair,et al.  Personal Translator at WMT2011 , 2011, WMT@EMNLP.

[32]  Rynson W. H. Lau,et al.  A pedagogical interface for authoring adaptive e-learning courses , 2010, MTDL '10.

[33]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[34]  Bogdan Babych,et al.  Sensitivity of Automated MT Evaluation Metrics on Higher Quality MT Output: BLEU vs Task-Based Evaluation Methods , 2008, LREC.

[35]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[36]  Barry Haddow,et al.  Improved Minimum Error Rate Training in Moses , 2009, Prague Bull. Math. Linguistics.

[37]  Dick C. A. Bulterman,et al.  Structured multimedia authoring , 2005, TOMCCAP.

[38]  François Masselot,et al.  A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context , 2010, Prague Bull. Math. Linguistics.

[39]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[40]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[41]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[42]  Kenji Araki,et al.  Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English , 2011 .

[43]  Jörg Tiedemann,et al.  LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation , 2012, ACL.

[44]  Gregor Thurmair,et al.  Creating Term and Lexicon Entries from Phrase Tables , 2012, EAMT.

[45]  Sébastien Paquet,et al.  Translation the Wiki way , 2006, WikiSym '06.

[46]  Ignacio Garcia,et al.  Beyond translation memory : computers and the professional translator , 2009 .

[47]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[48]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.

[49]  Philipp Koehn,et al.  Large and Diverse Language Models for Statistical Machine Translation , 2008, IJCNLP.

[50]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[51]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[52]  Sharon O'Brien,et al.  Methodologies for Measuring the Correlations between Post-Editing Effort and Machine Translatability , 2005, Machine Translation.

[53]  Hermann Ney,et al.  Partitioning Parallel Documents Using Binary Segmentation , 2006, WMT@HLT-NAACL.

[54]  Matteo Gaeta,et al.  A mash-up authoring tool for e-learning based on pedagogical templates , 2009, MTDL '09.

[55]  Josef van Genabith,et al.  Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study , 2012, EAMT.

[56]  H. Escudero,et al.  Exchanging courses between different Intelligent Tutoring Systems: A generic course generation authoring tool , 2010, Knowl. Based Syst..

[57]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[58]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[59]  Ralf Steinberger,et al.  An overview of the European Union’s highly multilingual parallel corpora , 2014, Language Resources and Evaluation.

[60]  Benjamin K. Tsou,et al.  Building a Large English-Chinese Parallel Corpus from Comparable Patents and its Experimental Application to SMT , 2010 .

[61]  Andrei Popescu-Belis,et al.  CESTA: First Conclusions of the Technolangue MT Evaluation Campaign , 2006, LREC.

[62]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.