A Structured Review of the Validity of BLEU
暂无分享,去创建一个
[1] George R. Doddington,et al. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .
[2] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[3] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[4] Philipp Koehn,et al. Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.
[5] Anja Belz,et al. Comparing Automatic and Human Evaluation of NLG Systems , 2006, EACL.
[6] Albert Gatt,et al. Intrinsic vs. Extrinsic Evaluation Measures for Referring Expression Generation , 2008, ACL.
[7] Anja Belz,et al. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.
[8] D. Moher,et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement , 2009, BMJ.
[9] Michael White,et al. Further Meta-Evaluation of Broad-Coverage Surface Realization , 2010, EMNLP.
[10] Kemal Oflazer,et al. A Human Judgement Corpus and a Metric for Arabic MT Evaluation , 2014, EMNLP.
[11] Yvette Graham,et al. Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.
[12] Ondrej Bojar,et al. Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.
[13] V. Prasad,et al. The Strength of Association Between Surrogate End Points and Survival in Oncology: A Systematic Review of Trial-Level Meta-analyses. , 2015, JAMA internal medicine.
[14] Karin M. Verspoor,et al. Findings of the 2016 Conference on Machine Translation , 2016, WMT.
[15] Verena Rieser,et al. Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.
[16] Mert Kilickaya,et al. Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.