暂无分享,去创建一个
Jungo Kasai | Yejin Choi | Noah A. Smith | Daniel Khashabi | Gabriel Stanovsky | Daniel S. Weld | Nicholas Lourie | Jonathan Bragg | Yejin Choi | Daniel Khashabi | Gabriel Stanovsky | Jonathan Bragg | Jungo Kasai | Nicholas Lourie
[1] Asli Celikyilmaz,et al. Evaluation of Text Generation: A Survey , 2020, ArXiv.
[2] Sameer Singh,et al. MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics , 2020, EMNLP.
[3] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.
[4] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[5] Deborah A. Coughlin,et al. Correlating automated and human assessments of machine translation quality , 2003, MTSUMMIT.
[6] Mausam,et al. To Re(label), or Not To Re(label) , 2014, HCOMP.
[7] Doug Downey,et al. Abductive Commonsense Reasoning , 2019, ICLR.
[8] Myle Ott,et al. On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.
[9] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[10] Ondrej Bojar,et al. Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.
[11] Marta R. Costa-jussà,et al. Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.
[12] Philipp Koehn,et al. Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.
[13] Philipp Koehn,et al. Johns Hopkins University Submission for WMT News Translation Task , 2019, WMT.
[14] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[15] Timothy Baldwin,et al. Is Machine Translation Getting Better over Time? , 2014, EACL.
[16] Philipp Koehn,et al. Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.
[17] Dragomir R. Radev,et al. Generating summaries of multiple news articles , 1995, SIGIR '95.
[18] Ondrej Bojar,et al. Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.
[19] Jonathan Berant,et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.
[20] Benjamin Van Durme,et al. Efficient Online Scalar Annotation with Bounded Support , 2018, ACL.
[21] Rajendra Bhatia,et al. A Better Bound on the Variance , 2000, Am. Math. Mon..
[22] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[23] Christopher D. Manning,et al. Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.
[24] Sameer Singh,et al. Evaluating Question Answering Evaluation , 2019, EMNLP.
[25] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Philipp Koehn,et al. Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.
[27] Gunhee Kim,et al. Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.
[28] Ido Dagan,et al. How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.
[29] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.
[30] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[31] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.
[32] Philipp Koehn,et al. Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.
[33] Percy Liang,et al. Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.
[34] Shashi Narayan,et al. HighRES: Highlight-based Reference-less Evaluation of Summarization , 2019, ACL.
[35] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[36] Lora Aroyo,et al. Metrology for AI: From Benchmarks to Instruments , 2019, ArXiv.
[37] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[38] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[39] Myle Ott,et al. Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.
[40] George R. Doddington,et al. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .
[41] Lucia Specia,et al. Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale , 2020, COLING.
[42] Jianfeng Gao,et al. An Information-Theoretic Approach to Automatic Evaluation of Summaries , 2006, NAACL.
[43] Michael S. Bernstein,et al. HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models , 2019, NeurIPS.
[44] Ido Dagan,et al. Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.
[45] Yejin Choi,et al. Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning , 2020, EMNLP.
[46] Yejin Choi,et al. Evaluating Machines by their Real-World Language Use , 2020, ArXiv.
[47] Albert Gatt,et al. Best practices for the human evaluation of automatically generated text , 2019, INLG.
[48] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[49] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[50] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[51] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[52] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.
[53] Percy Liang,et al. The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.
[54] Gabriel Stanovsky,et al. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.
[55] Andy Way,et al. Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.
[56] Karin M. Verspoor,et al. Findings of the 2016 Conference on Machine Translation , 2016, WMT.
[57] Timothy Baldwin,et al. Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.
[58] Ani Nenkova,et al. Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.
[59] Chris Callison-Burch,et al. ChatEval: A Tool for Chatbot Evaluation , 2019, NAACL.
[60] Yao Zhao,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.
[61] Philipp Koehn,et al. Translationese in Machine Translation Evaluation , 2019, EMNLP.