bleu2vec: the Painfully Familiar Metric on Continuous Vector Space Steroids

In this participation in the WMT’2017 metrics shared task we implement a fuzzy match score for n-gram precisions in the BLEU metric. To do this we learn ngram embeddings; we describe two ways of extending the WORD2VEC approach to do so. Evaluation results show that the introduced score beats the original BLEU metric on system and segment level. 1 The Painfully Familiar Metric The BLEU metric (Papineni et al., 2002) has deeply rooted in the machine translation community and is used in virtually every paper on machine translation methods. Despite the wellknown criticism (Callison-Burch et al., 2006) and a decade of collective efforts to come up with a better translation quality metric (from CallisonBurch et al., 2007 to Bojar et al., 2016) it still appeals with its ease of implementation, language independence and competitive agreement rate with human judgments, with the only viable alternative on all three accounts being the recently introduced CHRF (Popovic, 2015). The original version of BLEU is harsh on single sentences: one of the factors of the score is a geometric mean of n-gram precisions between the translation hypothesis and reference(s) and as a result sentences without 4-gram matches get a score of 0, even if there are good unigram, bigram and possibly trigram matches. There have been several attempts to “soften” this approach by using arithmetic mean instead (NIST, Doddington, 2002), allowing for partial matches using lemmatization and synonyms (METEOR, Banerjee and Lavie, 2005) and directly implementing fuzzy matches between n-grams (LEBLEU, Virpioja and Grönroos, 2015). Our work is most closely related to LEBLEU, where BLEU is augmented with fuzzy matches based on the character-level Levenshtein distance. Here we use independently learned word and n-gram embeddings instead. 2 The Continuous Vector Space Steroids Together with neural networks came the necessity to map sparse discrete values (like natural language words) into dense continuous vector representations. This is done explicitly e.g. with WORD2VEC (Mikolov et al., 2013), as well as learned as part of the whole learning process in neural networks-based language models (Mikolov et al., 2010) and translation approaches (Bahdanau et al., 2015). The approach of learning embeddings has since been extended for example to items in a relational database (Barkan and Koenigstein, 2016), sentences and documents (Le and Mikolov, 2014) and even users (Amir et al., 2017). The core part of this work consists of n-gram embeddings, the aim of which is to find similarities between short phrases like “research paper” and “scientific article”, or “do not like” and “hate”. We propose two solutions, both reducing the problem to the original WORD2VEC ; the first one only handles n-grams of the same length while the second one is more general. These are described in the following sections. 2.1 Separate N-gram Embeddings Our first approach is learning separate embedding models for unigrams, bigrams and trigrams. While

[1]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Kevin Gimpel,et al.  Charagram: Embedding Words and Sentences via Character n-grams , 2016, EMNLP.

[4]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[5]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[6]  Mark Dredze,et al.  Learning Composition Models for Phrase Embeddings , 2015, TACL.

[7]  Oren Barkan,et al.  ITEM2VEC: Neural item embedding for collaborative filtering , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[8]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[9]  Ondrej Bojar,et al.  Results of the WMT16 Metrics Shared Task , 2016 .

[10]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[11]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[14]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[15]  Sami Virpioja,et al.  LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages , 2015, WMT@EMNLP.

[16]  Byron C. Wallace,et al.  Quantifying Mental Health from Social Media with Neural User Embeddings , 2017, MLHC.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.