Integrating Meaning into Quality Evaluation of Machine Translation

Machine translation (MT) quality is evaluated through comparisons between MT outputs and the human translations (HT). Traditionally, this evaluation relies on form related features (e.g. lexicon and syntax) and ignores the transfer of meaning reflected in HT outputs. Instead, we evaluate the quality of MT outputs through meaning related features (e.g. polarity, subjectivity) with two experiments. In the first experiment, the meaning related features are compared to human rankings individually. In the second experiment, combinations of meaning related features and other quality metrics are utilized to predict the same human rankings. The results of our experiments confirm the benefit of these features in predicting human evaluation of translation quality in addition to traditional metrics which focus mainly on form.

[1]  Thomas Hofmann,et al.  TrueSkill™: A Bayesian Skill Rating System , 2007 .

[2]  Christian Buck Black Box Features for the WMT 2012 Quality Estimation Shared Task , 2012, WMT@NAACL-HLT.

[3]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[4]  Jean-Marc Dewaele,et al.  Formality of Language: definition, measurement and behavioral determinants , 1999 .

[5]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[6]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[7]  Joel R. Tetreault,et al.  An Empirical Analysis of Formality in Online Communication , 2016, TACL.

[8]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[9]  Hwee Tou Ng,et al.  TESLA at WMT 2011: Translation Evaluation and Tunable Metric , 2011, WMT@EMNLP.

[10]  Xiaojun Wan,et al.  Co-Training for Cross-Lingual Sentiment Classification , 2009, ACL.

[11]  Mykola Pechenizkiy,et al.  Cross-lingual polarity detection with machine translation , 2013, WISDOM '13.

[12]  Derek F. Wong,et al.  Machine Translation Evaluation: A Survey , 2016, ArXiv.

[13]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[14]  Kemal Oflazer,et al.  BLEU+: a Tool for Fine-Grained BLEU Computation , 2008, LREC.

[15]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.

[16]  Qun Liu,et al.  CASICT-DCU Participation in WMT2015 Metrics Task , 2015, WMT@EMNLP.

[17]  Paula Estrella,et al.  Semantic Textual Similarity for MT evaluation , 2012, WMT@NAACL-HLT.

[18]  Saif Mohammad,et al.  How Translation Alters Sentiment , 2016, J. Artif. Intell. Res..

[19]  Guodong Zhou,et al.  Phrase-Based Evaluation for Machine Translation , 2012, COLING.

[20]  Walt Detmar Meurers,et al.  Readability-based Sentence Ranking for Evaluating Text Simplification , 2016, ArXiv.

[21]  Boxing Chen,et al.  Bilingual Sentiment Consistency for Statistical Machine Translation , 2014, EACL.

[22]  A. Elo The rating of chessplayers, past and present , 1978 .

[23]  Caglar Tirkaz,et al.  A Morphology-Aware Network for Morphological Disambiguation , 2016, AAAI.

[24]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25]  Jörg Tiedemann,et al.  Estimating Word Alignment Quality for SMT Reordering Tasks , 2014, WMT@ACL.

[26]  Yejin Choi,et al.  Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning , 2013, ACL.

[27]  Christian Federmann,et al.  Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output , 2012, Prague Bull. Math. Linguistics.

[28]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[29]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[30]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[31]  Maite Taboada,et al.  A review corpus annotated for negation, speculation and their scope , 2012, LREC.

[32]  Khalil Sima'an,et al.  BEER: BEtter Evaluation as Ranking , 2014, WMT@ACL.

[33]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[34]  Watanabe Hideo,et al.  Deeper Sentiment Analysis Using Machine Translation Technology , 2004, COLING.

[35]  Marine Carpuat Connotation in Translation , 2015, WASSA@EMNLP.

[36]  Lucia Specia,et al.  SHEF-NN: Translation Quality Estimation with Neural Networks , 2015, WMT@EMNLP.

[37]  Dekai Wu,et al.  MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles , 2011, ACL.