论文信息 - deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets - 字舞流文

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

We introduce Discriminative BLEU (∆BLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [−1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, ∆BLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman’s ρ and Kendall’s τ .

Jianfeng Gao | Chris Quirk | Chris Brockett | William B. Dolan | Yangfeng Ji | Michael Auli | Margaret Mitchell | Alessandro Sordoni | Michel Galley | Jianfeng Gao | W. Dolan | Michael Auli | Chris Quirk | Chris Brockett | Michel Galley | Margaret Mitchell | Alessandro Sordoni | Yangfeng Ji

[1] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2] George R. Doddington,et al. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[3] Larry P. Heck,et al. Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[4] Hong Sun,et al. Joint Learning of a Dual SMT System for Paraphrase Generation , 2012, ACL.

[5] Preslav Nakov,et al. Optimizing for Sentence-Level BLEU+1 Yields Short Translations , 2012, COLING.

[6] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[7] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[8] Philipp Koehn,et al. Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[9] Jianfeng Gao,et al. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[10] Stephen E. Robertson,et al. GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[11] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[12] Deborah A. Coughlin,et al. Correlating automated and human assessments of machine translation quality , 2003, MTSUMMIT.

[13] Gregory A. Sanders,et al. The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results , 2009, Machine Translation.

[14] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[15] Alan Ritter,et al. Data-Driven Response Generation in Social Media , 2011, EMNLP.

[16] Timothy Baldwin,et al. Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[17] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Daniel Marcu,et al. HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[19] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[20] Timothy Baldwin,et al. Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.