Design of a hybrid high quality machine translation system

This paper gives an overview of the ongoing FP7 project HyghTra (2010--2014). The HyghTra project is conducted in a partnership between academia and industry involving the University of Leeds and Lingenio GmbH (company). It adopts a hybrid and bootstrapping approach to the enhancement of MT quality by applying rule-based analysis and statistical evaluation techniques to both parallel and comparable corpora in order to extract linguistic information and enrich the lexical and syntactic resources of the underlying (rule-based) MT system that is used for analysing the corpora. The project places special emphasis on the extension of systems to new language pairs and corresponding rapid, automated creation of high quality resources. The techniques are fielded and evaluated within an existing commercial MT environment.

[1]  Andy Way,et al.  Hybridity in MT. Experiments on the Europarl Corpus , 2006, EAMT.

[2]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[3]  Reinhard Rapp A Freely Available Automatically Generated Thesaurus of Related Words , 2004, LREC.

[4]  Serge Sharoff,et al.  Open-source Corpora: Using the net to fish for linguistic data , 2006 .

[5]  Kurt Eberle,et al.  Rapid construction of explicative dictionaries using hybrid machine translation , 2008, KONVENS.

[6]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[7]  Ulrich Heid,et al.  A Tool/Database Interface for Multi-Level Analyses , 2012, LREC.

[8]  Pascale Fung,et al.  Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora , 2005, IJCNLP.

[9]  Serge Sharoff,et al.  Translating from under-resourced languages: comparing direct transfer against pivot translation , 2007, MTSUMMIT.

[10]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[11]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[12]  Serge Sharoff,et al.  A Uniform Interface to Large-Scale Linguistic Resources , 2006, LREC.

[13]  Serge Sharoff,et al.  Using Comparable Corpora to Solve Problems Difficult for Human Translators , 2006, ACL.

[14]  David McKelvie,et al.  MULTILINGUAL CORPORA FOR COOPERATION , 2011 .

[15]  David Yarowsky,et al.  Toward Statistical Machine Translation without Parallel Corpora , 2012, EACL 2012.

[16]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[17]  Michael C. McCord,et al.  A New Version of the Machine Translation System LMT , 1989 .

[18]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[19]  Matt Post,et al.  Syntax-based language models for statistical machine translation , 2010 .

[20]  Serge Sharoff,et al.  Assisting Translators in Indirect Lexical Transfer , 2007, ACL.

[21]  Bogdan Babych,et al.  Automated error analysis for multiword expressions: Using BLEU-type scores for automatic discovery of potential translation errors , 2009 .

[22]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[23]  Nizar Habash,et al.  Handling translation divergences: combining statistical and symbolic techniques in generation-heavy machine translation , 2002, AMTA.

[24]  M. Týnovský,et al.  Hybrid Approaches in Machine Translation , 2008 .

[25]  Sergei Nirenburg,et al.  Three Heads are Better than One , 1994, ANLP.

[26]  linguatec Gottfried-Keller,et al.  Using corpus information to improve MT quality , 2006 .

[27]  Satoshi Sato,et al.  Toward Memory-based Translation , 1990, COLING.

[28]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[29]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[30]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[31]  Gregor Thurmair Hybrid Architectures for Machine Translation Systems , 2005, Lang. Resour. Evaluation.

[32]  Leonid L. Iomdin,et al.  Learning Lessons from Bilingual Corpora: Benefits for Machine Translation , 2000 .

[33]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[34]  Оливер Штрайтер,et al.  A Virtual Translation Machine for Hy- brid Machine Translation , 2002 .

[35]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[36]  Kurt Eberle,et al.  FUDR-based MT, head switching and the lexicon , 2001, MTSUMMIT.

[37]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[38]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[39]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[40]  Andy Way,et al.  Hybrid data-driven models of machine translation , 2005, Machine Translation.

[41]  Andy Way,et al.  Example-based controlled translation , 2004, EAMT.

[42]  David McKelvie,et al.  Muliilingual corpora for cooperation , 1998 .

[43]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[44]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[45]  Kenji Yamada,et al.  Syntax-based language models for statistical machine translation , 2003, ACL 2003.