Aspects of Terminological and Named Entity Knowledge within Rule-Based Machine Translation Models for Under-Resourced Neural Machine Translation Scenarios

Rule-based machine translation is a machine translation paradigm where linguistic knowledge is encoded by an expert in the form of rules that translate text from source to target language. While this approach grants extensive control over the output of the system, the cost of formalising the needed linguistic knowledge is much higher than training a corpus-based system, where a machine learning approach is used to automatically learn to translate from examples. In this paper, we describe different approaches to leverage the information contained in rule-based machine translation systems to improve a corpus-based one, namely, a neural machine translation model, with a focus on a low-resource scenario. Three different kinds of information were used: morphological information, named entities and terminology. In addition to evaluating the general performance of the system, we systematically analysed the performance of the proposed approaches when dealing with the targeted phenomena. Our results suggest that the proposed models have limited ability to learn from external information, and most approaches do not significantly alter the results of the automatic evaluation, but our preliminary qualitative evaluation shows that in certain cases the hypothesis generated by our system exhibit favourable behaviour such as keeping the use of passive voice.

[1]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[2]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[3]  Jan Niehues,et al.  Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning , 2017, WMT.

[4]  Yoav Goldberg,et al.  Towards String-To-Tree Neural Machine Translation , 2017, ACL.

[5]  Yoshimasa Tsuruoka,et al.  Tree-to-Sequence Attentional Neural Machine Translation , 2016, ACL.

[6]  Pushpak Bhattacharyya,et al.  Meaningless yet meaningful: Morphology grounded subword-level NMT , 2018 .

[7]  Paul Buitelaar,et al.  Translating Terminological Expressions in Knowledge Bases with Neural Machine Translation , 2017 .

[8]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[9]  Paul Buitelaar,et al.  IRIS: English-Irish Machine Translation System , 2016, LREC.

[10]  Jiajun Zhang,et al.  The Impact of Named Entity Translation for Neural Machine Translation , 2018, Communications in Computer and Information Science.

[11]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[12]  Akihiro Tamura,et al.  Neural Machine Translation Incorporating Named Entity , 2018, COLING.

[13]  Francis M. Tyers,et al.  Universal Dependencies , 2017, EACL.

[14]  Rico Sennrich,et al.  Predicting Target Language CCG Supertags Improves Neural Machine Translation , 2017, WMT.

[15]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[16]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[17]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[18]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[19]  Paul Buitelaar,et al.  Leveraging bilingual terminology to improve machine translation in a CAT environment* , 2017, Natural Language Engineering.

[20]  Khalil Sima'an,et al.  Graph Convolutional Encoders for Syntax-aware Neural Machine Translation , 2017, EMNLP.

[21]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.

[22]  Fethi Bougares,et al.  Factored Neural Machine Translation Architectures , 2016, IWSLT.

[23]  Paul Buitelaar,et al.  Enhancing statistical machine translation with bilingual terminology in a CAT environment , 2014, AMTA.

[24]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[25]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[26]  Anthony Rousseau,et al.  XenC: An Open-Source Tool for Data Selection in Natural Language Processing , 2013, Prague Bull. Math. Linguistics.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Paul Buitelaar,et al.  Augmenting Neural Machine Translation with Knowledge Graphs , 2019, ArXiv.

[29]  George F. Foster,et al.  Bilingual Methods for Adaptive Training Data Selection for Machine Translation , 2016, AMTA.

[30]  Gonzalo Iglesias,et al.  Neural Machine Translation Decoding with Terminology Constraints , 2018, NAACL.

[31]  Jiajun Zhang,et al.  Neural Name Translation Improves Neural Machine Translation , 2016, Communications in Computer and Information Science.

[32]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[33]  Rico Sennrich,et al.  Linguistic Input Features Improve Neural Machine Translation , 2016, WMT.

[34]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[35]  Jörg Tiedemann,et al.  Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[36]  Marcello Federico,et al.  Compositional Representation of Morphologically-Rich Input for Neural Machine Translation , 2018, ACL.

[37]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[38]  Gorka Labaka,et al.  Neural Machine Translation of Basque , 2018, EAMT.

[39]  German Rigau,et al.  IXA pipeline: Efficient and Ready to Use Multilingual NLP tools , 2014, LREC.

[40]  Lucia Specia,et al.  Guiding Neural Machine Translation Decoding with External Knowledge , 2017, WMT.

[41]  Montserrat Civit,et al.  Building Cast3LB: A Spanish Treebank , 2004 .

[42]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[43]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[44]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[45]  Bharathi Raja Chakravarthi,et al.  Leveraging Rule-Based Machine Translation Knowledge for Under-Resourced Neural Machine Translation Models , 2019, MTSummit.

[46]  Paul Buitelaar,et al.  Translating Domain-Specific Expressions in Knowledge Bases with Neural Machine Translation , 2017, ArXiv.

[47]  Kevin Knight,et al.  Multi-Source Neural Translation , 2016, NAACL.

[48]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[49]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[50]  Ralf Steinberger,et al.  An overview of the European Union’s highly multilingual parallel corpora , 2014, Language Resources and Evaluation.