SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
暂无分享,去创建一个
Sebastian Gehrmann | Dipanjan Das | Vitaly Nikolaev | Ankur Parikh | Roee Aharoni | Thibault Sellam | Aditya Siddhant | Elizabeth Clark | Shruti Rijhwani | Joshua Maynez
[1] Dan Iter,et al. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , 2023, EMNLP.
[2] A. Borji. A Categorical Archive of ChatGPT Failures , 2023, ArXiv.
[3] Mirella Lapata,et al. mFACE: Multilingual Summarization with Factual Consistency Evaluation , 2022, ACL.
[4] Shafiq R. Joty,et al. Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation , 2022, ACL.
[5] Y. Matias,et al. TRUE: Re-evaluating Factual Consistency Evaluation , 2022, NAACL.
[6] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[7] Marc van Zee,et al. Scaling Up Models and Data with t5x and seqio , 2022, J. Mach. Learn. Res..
[8] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[9] Sebastian Gehrmann,et al. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text , 2022, J. Artif. Intell. Res..
[10] Pascale Fung,et al. Survey of Hallucination in Natural Language Generation , 2022, ACM Comput. Surv..
[11] Gaurav Singh Tomar,et al. Measuring Attribution in Natural Language Generation Models , 2021, Computational Linguistics.
[12] Yejin Choi,et al. Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts , 2021, NAACL.
[13] Wenhao Liu,et al. DialFact: A Benchmark for Fact-Checking in Dialogue , 2021, ACL.
[14] Marcin Junczys-Dowmunt,et al. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.
[15] Rifat Shahriyar,et al. XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages , 2021, FINDINGS.
[16] David Reitter,et al. Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark , 2021, Transactions of the Association for Computational Linguistics.
[17] Markus Freitag,et al. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.
[18] Artidoro Pagnoni,et al. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , 2021, NAACL.
[19] Idan Szpektor,et al. Q^{2}: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering , 2021, EMNLP.
[20] Regina Barzilay,et al. Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , 2021, NAACL.
[21] Diyi Yang,et al. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.
[22] Mona T. Diab,et al. Detecting Hallucinated Content in Conditional Neural Sequence Generation , 2020, FINDINGS.
[23] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.
[24] Graham Neubig,et al. Re-evaluating Evaluation in Text Summarization , 2020, EMNLP.
[25] Claire Cardie,et al. WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization , 2020, FINDINGS.
[26] Alon Lavie,et al. COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.
[27] Ryan J. Lowe,et al. Learning to summarize from human feedback , 2020, NeurIPS 2020.
[28] Dragomir R. Radev,et al. SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.
[29] Ryan McDonald,et al. On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.
[30] Sylvain Lamprier,et al. MLSUM: The Multilingual Summarization Corpus , 2020, EMNLP.
[31] Diyi Yang,et al. ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.
[32] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[33] Alex Wang,et al. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.
[34] Richard Socher,et al. Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.
[35] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[36] Ankur P. Parikh,et al. Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation , 2019, ArXiv.
[37] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[38] Jason Baldridge,et al. PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.
[39] Guillaume Lample,et al. XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.
[40] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.
[41] Peter Clark,et al. SciTaiL: A Textual Entailment Dataset from Science Question Answering , 2018, AAAI.
[42] Andreas Vlachos,et al. FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.
[43] Benno Stein,et al. TL;DR: Mining Reddit to Learn Automatic Summarization , 2017, NFiS@EMNLP.
[44] Alexander M. Rush,et al. Challenges in Data-to-Document Generation , 2017, EMNLP.
[45] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[46] Ondrej Bojar,et al. Results of the WMT16 Metrics Shared Task , 2016, WMT.
[47] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.
[48] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.
[49] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[50] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[51] A. Lavie,et al. Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain , 2021, WMT.
[52] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[53] Walter Daelemans,et al. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014, EMNLP 2014.
[54] Yaroslav Fyodorov,et al. A Natural Logic Inference System , 2000 .