Building English-to-Serbian Machine Translation System for IMDb Movie Reviews

This paper reports the results of the first experiment dealing with the challenges of building a machine translation system for user-generated content involving a complex South Slavic language. We focus on translation of English IMDb user movie reviews into Serbian, in a low-resource scenario. We explore potentials and limits of (i) phrase-based and neural machine translation systems trained on out-of-domain clean parallel data from news articles (ii) creating additional synthetic in-domain parallel corpus by machine-translating the English IMDb corpus into Serbian. Our main findings are that morphology and syntax are better handled by the neural approach than by the phrase-based approach even in this low-resource mismatched domain scenario, however the situation is different for the lexical aspect, especially for person names. This finding also indicates that in general, machine translation of person names into Slavic languages (especially those which require/allow transcription) should be investigated more systematically.

[1]  Maja Popovic,et al.  Language-related issues for NMT and PBMT for English–German and English–Serbian , 2018, Machine Translation.

[2]  François Yvon,et al.  Using Monolingual Data in Neural Machine Translation: a Systematic Study , 2018, WMT.

[3]  Arkaitz Zubiaga,et al.  TweetMT: A Parallel Microblog Corpus , 2016, LREC.

[4]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[5]  Andy Way,et al.  Balancing Translation Quality and Sentiment Preservation (Non-archival Extended Abstract) , 2018, AMTA.

[6]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[7]  Stefan Riezler,et al.  Twitter Translation using Translation-Based Cross-Lingual Retrieval , 2012, WMT@NAACL-HLT.

[8]  Tomaz Erjavec,et al.  Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text , 2017, BSNLP@EACL.

[9]  Jennifer Foster,et al.  Estimating the Quality of Translated User-Generated Content , 2013, IJCNLP.

[10]  Alexandra Balahur,et al.  Multilingual Sentiment Analysis using Machine Translation? , 2012, WASSA@ACL.

[11]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[12]  Andy Way,et al.  Investigating Backtranslation in Neural Machine Translation , 2018, EAMT.

[13]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[14]  Maja Popovic Hjerson: An Open Source Tool for Automatic Error Classification of Machine Translation Output , 2011, Prague Bull. Math. Linguistics.

[15]  Jan Snajder,et al.  Comparison of Short-Text Sentiment Analysis Methods for Croatian , 2017, BSNLP@EACL.

[16]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[17]  Mihael Arcan,et al.  Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages , 2015, EAMT.

[18]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[19]  Maja Popović,et al.  Exploring cross-language statistical machine translation for closely related South Slavic languages , 2014, EMNLP 2014.

[20]  Antonio Toral,et al.  A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions , 2017, EACL.

[21]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[22]  Alexandru Ceausu,et al.  South-East European Times : A parallel corpus of Balkan languages , Francis Tyers and , 2010 .

[23]  Andy Way,et al.  Domain Adaptation in SMT of User-Generated Forum Content Guided by OOV Word Reduction: Normalization and/or Supplementary Data , 2012, EAMT.

[24]  Alexandra Balahur,et al.  Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis , 2014, Comput. Speech Lang..

[25]  Andy Way,et al.  Maintaining Sentiment Polarity in Translation of User-Generated Content , 2017, Prague Bull. Math. Linguistics.

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[27]  Janez Brest,et al.  Slavic languages in phrase-based statistical machine translation: a survey , 2017, Artificial Intelligence Review.

[28]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[29]  Wang Ling,et al.  Microblogs as Parallel Corpora , 2013, ACL.

[30]  Jaehong Park,et al.  Building a Neural Machine Translation System Using Only Synthetic Parallel Data , 2017, ArXiv.