暂无分享,去创建一个
Jungo Kasai | Yejin Choi | Noah A. Smith | Daniel Khashabi | Gabriel Stanovsky | Daniel S. Weld | Nicholas Lourie | Jonathan Bragg
[1] Elizabeth Clark,et al. Evaluation of Text Generation: A Survey , 2020, ArXiv.
[2] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[3] Lucia Specia,et al. Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale , 2020, COLING.
[4] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[5] Christopher D. Manning,et al. Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.
[6] Jonathan Berant,et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.
[7] Timothy Baldwin,et al. Is Machine Translation Getting Better over Time? , 2014, EACL.
[8] Jianfeng Gao,et al. An Information-Theoretic Approach to Automatic Evaluation of Summaries , 2006, NAACL.
[9] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[10] Matt Gardner,et al. MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics , 2020, EMNLP.
[11] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[12] Gabriel Stanovsky,et al. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.
[13] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.
[14] Deborah A. Coughlin,et al. Correlating automated and human assessments of machine translation quality , 2003, MTSUMMIT.
[15] Philipp Koehn,et al. Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.
[16] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[17] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[18] Philipp Koehn,et al. Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.
[19] Mausam,et al. To Re(label), or Not To Re(label) , 2014, HCOMP.
[20] Myle Ott,et al. Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.
[21] Philipp Koehn,et al. Translationese in Machine Translation Evaluation , 2019, EMNLP.
[22] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[23] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.
[24] Yao Zhao,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.
[25] Rajendra Bhatia,et al. A Better Bound on the Variance , 2000, Am. Math. Mon..
[26] Benjamin Van Durme,et al. Efficient Online Scalar Annotation with Bounded Support , 2018, ACL.
[27] Dragomir R. Radev,et al. Generating summaries of multiple news articles , 1995, SIGIR '95.
[28] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Yejin Choi,et al. Evaluating Machines by their Real-World Language Use , 2020, ArXiv.
[30] Ido Dagan,et al. Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.
[31] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[32] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.
[33] Philipp Koehn,et al. Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.
[34] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[35] Shashi Narayan,et al. HighRES: Highlight-based Reference-less Evaluation of Summarization , 2019, ACL.
[36] Percy Liang,et al. Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.
[37] Chris Callison-Burch,et al. ChatEval: A Tool for Chatbot Evaluation , 2019, NAACL.
[38] Richard Socher,et al. SummEval: Re-evaluating Summarization Evaluation , 2020, ArXiv.
[39] Ani Nenkova,et al. Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.
[40] Philipp Koehn,et al. Johns Hopkins University Submission for WMT News Translation Task , 2019, WMT.
[41] Lora Aroyo,et al. Metrology for AI: From Benchmarks to Instruments , 2019, ArXiv.
[42] Andy Way,et al. Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.
[43] Gunhee Kim,et al. Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.
[44] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[45] Doug Downey,et al. Abductive Commonsense Reasoning , 2019, ICLR.
[46] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[47] George R. Doddington,et al. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .
[48] Ido Dagan,et al. How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.
[49] Timothy Baldwin,et al. Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.
[50] Michael S. Bernstein,et al. HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models , 2019, NeurIPS.
[51] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[52] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.
[53] Ondrej Bojar,et al. Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.
[54] Marta R. Costa-jussà,et al. Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.
[55] Albert Gatt,et al. Best practices for the human evaluation of automatically generated text , 2019, INLG.
[56] Nanyun Peng,et al. STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation , 2020, EMNLP.
[57] Yejin Choi,et al. Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning , 2020, EMNLP.
[58] Percy Liang,et al. The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.
[59] Ondrej Bojar,et al. Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.
[60] Sameer Singh,et al. Evaluating Question Answering Evaluation , 2019, EMNLP.
[61] Philipp Koehn,et al. Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.
[62] Karin M. Verspoor,et al. Findings of the 2016 Conference on Machine Translation , 2016, WMT.
[63] Myle Ott,et al. On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.