BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
暂无分享,去创建一个
A. Cahill | A. Jaimes | Kecheng Zhang | Liang Ma | IV RobertL.Logan | Shihao Ran | Joel Tetreault | Shuyang Cao | Di Lu
[1] Justin F. Rousseau,et al. Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors , 2022, ACL.
[2] Y. Matias,et al. TRUE: Re-evaluating Factual Consistency Evaluation , 2022, NAACL.
[3] Alexander R. Fabbri,et al. QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization , 2021, NAACL.
[4] Paul N. Bennett,et al. SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization , 2021, TACL.
[5] Sebastian Gehrmann,et al. Learning Compact Metrics for MT , 2021, EMNLP.
[6] Bolin Ding,et al. Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation , 2021, EMNLP.
[7] Weizhe Yuan,et al. BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.
[8] Artidoro Pagnoni,et al. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , 2021, NAACL.
[9] Idan Szpektor,et al. Q^{2}: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering , 2021, EMNLP.
[10] Sylvain Lamprier,et al. QuestEval: Summarization Asks for Fact-based Evaluation , 2021, EMNLP.
[11] Yejin Choi,et al. GO FIGURE: A Meta Evaluation of Factuality in Summarization , 2020, FINDINGS.
[12] Tanya Goyal,et al. Evaluating Factuality in Generation with Dependency-level Entailment , 2020, FINDINGS.
[13] Dragomir R. Radev,et al. SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.
[14] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.
[15] Mona T. Diab,et al. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.
[16] Ryan McDonald,et al. On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.
[17] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[18] Alex Wang,et al. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.
[19] Samuel R. Bowman,et al. BLiMP: The Benchmark of Linguistic Minimal Pairs for English , 2019, Transactions of the Association for Computational Linguistics.
[20] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[21] Richard Socher,et al. Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.
[22] Richard Socher,et al. Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.
[23] Ido Dagan,et al. Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , 2019, ACL.
[24] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[25] Tal Linzen,et al. Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.
[26] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.
[27] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.
[28] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[29] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.