The statistical advantage of automatic NLG metrics at the system level
暂无分享,去创建一个
[1] Dragomir R. Radev,et al. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , 2019, ACL.
[2] Pete Koomen,et al. Peeking at A/B Tests: Why it matters, and what to do about it , 2017, KDD.
[3] Markus Freitag,et al. BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.
[4] Timothy Baldwin,et al. Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.
[5] Christopher D. Manning,et al. Importance sampling for unbiased on-demand evaluation of knowledge base population , 2017, EMNLP.
[6] Timothy Baldwin,et al. Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.
[7] Andy Way,et al. Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.
[8] Jun-Ping Ng,et al. Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.
[9] Timothy Baldwin,et al. Is all that Glitters in Machine Translation Quality Estimation really Gold? , 2016, COLING.
[10] Gang Li,et al. Performance of Regression Models as a Function of Experiment Noise , 2019, Bioinformatics and biology insights.
[11] Dongyan Zhao,et al. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.
[12] Anastassia Loukina,et al. Using PRMSE to evaluate automated scoring systems in the presence of label noise , 2020, BEA.
[13] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[14] Aditi Raghunathan,et al. Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.
[15] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[16] Percy Liang,et al. Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.
[17] Emiel Krahmer,et al. Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..
[18] Philipp Koehn,et al. Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools and Data Sets to an Integrated Ecosystem” , 2016 .
[19] Markus Freitag,et al. APE at Scale and Its Implications on MT Evaluation Biases , 2019, WMT.
[20] Aleksander Madry,et al. Identifying Statistical Bias in Dataset Replication , 2020, ICML.
[21] Wei Zhao,et al. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization , 2020, ACL.
[22] Ido Dagan,et al. Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.
[23] Katherine A. Keith,et al. Uncertainty-aware generative models for inferring document class prevalence , 2018, EMNLP.
[24] Ondrej Bojar,et al. Results of the WMT17 Metrics Shared Task , 2017, WMT.
[25] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.
[26] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .
[27] Dan Klein,et al. An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.
[28] Percy Liang,et al. The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.
[29] Nitika Mathur,et al. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.
[30] Rico Sennrich,et al. Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.
[31] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.
[32] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[33] Ondrej Bojar,et al. Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.
[34] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[35] Quoc V. Le,et al. Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.
[36] Ehud Reiter,et al. A Structured Review of the Validity of BLEU , 2018, CL.
[37] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[38] Ondrej Bojar,et al. Results of the WMT16 Metrics Shared Task , 2016 .
[39] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[40] Markus Freitag,et al. Results of the WMT20 Metrics Shared Task , 2020, WMT.
[41] Markus Freitag,et al. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.
[42] Philipp Koehn,et al. Ten Years of WMT Evaluation Campaigns: Lessons Learnt , 2016 .
[43] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.
[44] Philipp Koehn,et al. Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.
[45] Graham Neubig,et al. compare-mt: A Tool for Holistic Comparison of Language Generation Systems , 2019, NAACL.
[46] Pedro M. Domingos. A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.
[47] Rotem Dror,et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.
[48] Danqi Chen,et al. of the Association for Computational Linguistics: , 2001 .
[49] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[50] Peter Henderson,et al. With Little Power Comes Great Responsibility , 2020, EMNLP.
[51] Ondrej Bojar,et al. Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.
[52] Marta R. Costa-jussà,et al. Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.