Tapadoir: developing a statistical machine translation engine and associated resources for Irish

Tapadoir (from the Irish ´ tapa ‘fast’ and the nominal suffix -oir ´ ) is a statistical machine translation (SMT) project, funded by the Irish government. This work was commissioned to help government translators meet the translation demands which have arisen from the Irish language’s status as an official EU and national language. The development of this system, which translates English into Irish (a morphologically rich, low-resourced minority language), has produced an interesting set of challenges. These challenges have inspired a creative response to the lack of data and NLP tools available for the Irish language and have also resulted in the development of new resources for the Irish linguistic and NLP community. We show that our SMT system out-performs Google TranslateTM (a widely used general-domain SMT system) as a result of steps we have taken to tailor translation output to the user’s specific needs.

[1]  Nancy Stenson,et al.  Studies in Irish syntax , 1981 .

[2]  Steven Abney,et al.  Part-of-Speech Tagging and Partial Parsing , 1997 .

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[5]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[6]  Kristina Toutanova,et al.  Generating Complex Morphology for Machine Translation , 2007, ACL.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[9]  Christopher D. Manning,et al.  A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[10]  Ventsislav Zhechev Machine Translation Infrastructure and Post-editing Performance at Autodesk , 2012, AMTA.

[11]  Marcello Federico Measuring User Productivity in Machine Translation Enhanced Computer Assisted Translation , 2012, AMTA.

[12]  Jennifer Foster,et al.  Working with a small dataset - semi-supervised dependency parsing for Irish , 2013, SPMRL@EMNLP.

[13]  Gregor Thurmair,et al.  A modular open-source focused crawler for mining monolingual and bilingual corpora from the web , 2013, BUCC@ACL.

[14]  Alexander M. Fraser,et al.  Statistical Techniques for Translating to Morphologically Rich Languages (Dagstuhl Seminar 14061) , 2014, Dagstuhl Reports.