Constant evaluation is vital to the progress of machine translation. However, human evaluation is costly, time-consuming, and difficult to do reliably. On the other hand, automatic measures of machine evaluation performance (such as BLEU, NIST, TER, and METEOR), while cheap and objective, have increasingly come under suspicion as to whether they are satisfactory measuring instruments. Recent work (e.g., CallisonBurch et al. (2006)) has demonstrated that for current state-of-the-art MT systems, the correlation between BLEU scores and human adequacy and fluency ratings is often low; BLEU scores tend to favor statistical over rule-based systems; and BLEU-like measures tend to perform worse at the segment level than at the corpus level. The core of the problem is that BLEU (Papineni et al., 2001), and to a first approximation other automatic measures, work at the surface level, looking for shared word sequences between a system translation and one or more reference translations. This evaluation ignores many known facts about linguistic semantics, whereby the same meaning can be conveyed in many different ways, whether by the use of syntactic rearrangements or by exploiting lexical semantics (synonyms, etc.) and larger semantic paraphrases. Consider the two (real-world) example sentences in Figure 1, which are largely equivalent, but differ substantially on the surface. The equivalence of the two sentences hinges not only on the synonymy of individual words (practice/policy), but also on phrasal replacements (promote/make statements in favor of ) and lexical-semantic properties of words (such as the “built-in” negation of barring). In this paper, we present a study whose goal is to improve the prediction of adequacy judgments for MT system translations by accounting for such semantic phenomena. To do so, we model MT evaluation as an instance of the “recognition of textual entailment” (RTE) task (Dagan et al., 2005). RTE was introduced as a “practical” inference procedure that determines the existence of a causal relation between two short segments of text, the premise and the hypothesis: Is the hypothesis entailed by the premise, or not? Textual entailment has been found to be beneficial for a range of applications, for example in answer validation in Question Answering or in word sense disambiguation (Dagan et al., 2006; Harabagiu and Hickl, 2006). Our intuition is that the evaluation of MT output for adequacy can also be seen as an entailment task: A candidate translation (i.e., MT
[1]
Christopher D. Manning,et al.
Learning to recognize features of valid textual entailments
,
2006,
NAACL.
[2]
Ido Dagan,et al.
The Third PASCAL Recognizing Textual Entailment Challenge
,
2007,
ACL-PASCAL@ACL.
[3]
Salim Roukos,et al.
Bleu: a Method for Automatic Evaluation of Machine Translation
,
2002,
ACL.
[4]
George R. Doddington,et al.
Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics
,
2002
.
[5]
Sanda M. Harabagiu,et al.
Methods for Using Textual Entailment in Open-Domain Question Answering
,
2006,
ACL.
[6]
Christopher D. Manning,et al.
Aligning Semantic Graphs for Textual Inference and Machine Reading
,
2007
.
[7]
Chin-Yew Lin,et al.
ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation
,
2004,
COLING.
[8]
Ralph Weischedel,et al.
A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION
,
2005
.
[9]
Philipp Koehn,et al.
Re-evaluating the Role of Bleu in Machine Translation Research
,
2006,
EACL.
[10]
Carlo Strapparava,et al.
Direct Word Sense Matching for Lexical Substitution
,
2006,
ACL.
[11]
Matthew G. Snover,et al.
A Study of Translation Edit Rate with Targeted Human Annotation
,
2006,
AMTA.