论文信息 - Evaluating the Factual Consistency of Large Language Models Through Summarization - 字舞流文

Evaluating the Factual Consistency of Large Language Models Through Summarization

While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB ( F actual I nconsistency B enchmark) that focuses on the task of summarization. Specif-ically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model’s factual consistency is then measured according to its accuracy, i.e. the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB , we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We ﬁnd that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our

Colin Raffel | Mohit Bansal | Derek Tam | Shiyue Zhang | Anisha Mascarenhas | Sarah Kwan | Mohit Bansal | Colin Raffel

[1] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[2] Clara Meister,et al. Mutual Information Alleviates Hallucinations in Abstractive Summarization , 2022, EMNLP.

[3] Mohit Bansal,et al. Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization , 2022, ArXiv.

[4] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[5] Mohit Bansal,et al. FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization , 2022, NAACL.

[6] Kang Min Yoo,et al. Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking , 2022, NAACL-HLT.

[7] Junyi Jessy Li,et al. Evaluating Factuality in Text Simplification , 2022, ACL.

[8] Stella Rose Biderman,et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[9] Leonardo F. R. Ribeiro,et al. FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations , 2022, NAACL.

[10] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[11] Artidoro Pagnoni,et al. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , 2021, NAACL.

[12] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[13] Luke Zettlemoyer,et al. Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , 2021, EMNLP.

[14] Tanya Goyal,et al. Annotating and Modeling Fine-grained Factuality in Summarization , 2021, NAACL.

[15] Sylvain Lamprier,et al. QuestEval: Summarization Asks for Fact-based Evaluation , 2021, EMNLP.

[16] He He,et al. Unsupervised Extractive Summarization using Pointwise Mutual Information , 2021, EACL.

[17] Yejin Choi,et al. GO FIGURE: A Meta Evaluation of Factuality in Summarization , 2020, FINDINGS.

[18] Martha Larson,et al. Truth or Error? Towards systematic analysis of factual errors in abstractive summaries , 2020, EVAL4NLP.

[19] Claire Cardie,et al. Intrinsic Evaluation of Summarization Datasets , 2020, EMNLP.

[20] Weihua Luo,et al. Towards Enhancing Faithfulness for Neural Machine Translation , 2020, EMNLP.

[21] Tanya Goyal,et al. Evaluating Factuality in Generation with Dependency-level Entailment , 2020, FINDINGS.

[22] Mona T. Diab,et al. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.

[23] Ryan McDonald,et al. On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[24] Pengfei Liu,et al. Heterogeneous Graph Neural Networks for Extractive Document Summarization , 2020, ACL.

[25] Pengfei Liu,et al. Extractive Summarization as Text Matching , 2020, ACL.

[26] Alex Wang,et al. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[27] Peter J. Liu,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[28] Yu Cheng,et al. Discourse-Aware Neural Extractive Text Summarization , 2019, ACL.

[29] Richard Socher,et al. Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[30] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[33] Christian Hansen,et al. MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims , 2019, EMNLP.

[34] Xuanjing Huang,et al. Searching for Effective Neural Extractive Summarization: What Works and What’s Next , 2019, ACL.

[35] Mirella Lapata,et al. Sentence Centrality Revisited for Unsupervised Summarization , 2019, ACL.

[36] Ben Goodrich,et al. Assessing The Factual Accuracy of Generated Text , 2019, KDD.

[37] Ido Dagan,et al. Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , 2019, ACL.

[38] Jackie Chi Kit Cheung,et al. BanditSum: Extractive Summarization as a Contextual Bandit , 2018, EMNLP.

[39] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[40] Tiejun Zhao,et al. Neural Document Summarization by Jointly Learning to Score and Select Sentences , 2018, ACL.

[41] Yen-Chun Chen,et al. Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting , 2018, ACL.

[42] Andreas Vlachos,et al. FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[43] Mirella Lapata,et al. Ranking Sentences for Extractive Summarization with Reinforcement Learning , 2018, NAACL.

[44] Furu Wei,et al. Faithful to the Original: Fact Aware Neural Abstractive Summarization , 2017, AAAI.

[45] Christopher D. Manning,et al. Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[46] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[47] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[48] Rada Mihalcea,et al. TextRank: Bringing Order into Text , 2004, EMNLP.