Evaluating the Factual Consistency of Large Language Models Through Summarization

While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB ( F actual I nconsistency B enchmark) that focuses on the task of summarization. Specif-ically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model’s factual consistency is then measured according to its accuracy, i.e. the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB , we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our

[1]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[2]  Clara Meister,et al.  Mutual Information Alleviates Hallucinations in Abstractive Summarization , 2022, EMNLP.

[3]  Mohit Bansal,et al.  Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization , 2022, ArXiv.

[4]  M. Lewis,et al.  LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[5]  Mohit Bansal,et al.  FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization , 2022, NAACL.

[6]  Kang Min Yoo,et al.  Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking , 2022, NAACL-HLT.

[7]  Junyi Jessy Li,et al.  Evaluating Factuality in Text Simplification , 2022, ACL.

[8]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[9]  Leonardo F. R. Ribeiro,et al.  FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations , 2022, NAACL.

[10]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[11]  Artidoro Pagnoni,et al.  Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , 2021, NAACL.

[12]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[13]  Luke Zettlemoyer,et al.  Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , 2021, EMNLP.

[14]  Tanya Goyal,et al.  Annotating and Modeling Fine-grained Factuality in Summarization , 2021, NAACL.

[15]  Sylvain Lamprier,et al.  QuestEval: Summarization Asks for Fact-based Evaluation , 2021, EMNLP.

[16]  He He,et al.  Unsupervised Extractive Summarization using Pointwise Mutual Information , 2021, EACL.

[17]  Yejin Choi,et al.  GO FIGURE: A Meta Evaluation of Factuality in Summarization , 2020, FINDINGS.

[18]  Martha Larson,et al.  Truth or Error? Towards systematic analysis of factual errors in abstractive summaries , 2020, EVAL4NLP.

[19]  Claire Cardie,et al.  Intrinsic Evaluation of Summarization Datasets , 2020, EMNLP.

[20]  Weihua Luo,et al.  Towards Enhancing Faithfulness for Neural Machine Translation , 2020, EMNLP.

[21]  Tanya Goyal,et al.  Evaluating Factuality in Generation with Dependency-level Entailment , 2020, FINDINGS.

[22]  Mona T. Diab,et al.  FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.

[23]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[24]  Pengfei Liu,et al.  Heterogeneous Graph Neural Networks for Extractive Document Summarization , 2020, ACL.

[25]  Pengfei Liu,et al.  Extractive Summarization as Text Matching , 2020, ACL.

[26]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[27]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[28]  Yu Cheng,et al.  Discourse-Aware Neural Extractive Text Summarization , 2019, ACL.

[29]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[30]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[33]  Christian Hansen,et al.  MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims , 2019, EMNLP.

[34]  Xuanjing Huang,et al.  Searching for Effective Neural Extractive Summarization: What Works and What’s Next , 2019, ACL.

[35]  Mirella Lapata,et al.  Sentence Centrality Revisited for Unsupervised Summarization , 2019, ACL.

[36]  Ben Goodrich,et al.  Assessing The Factual Accuracy of Generated Text , 2019, KDD.

[37]  Ido Dagan,et al.  Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , 2019, ACL.

[38]  Jackie Chi Kit Cheung,et al.  BanditSum: Extractive Summarization as a Contextual Bandit , 2018, EMNLP.

[39]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[40]  Tiejun Zhao,et al.  Neural Document Summarization by Jointly Learning to Score and Select Sentences , 2018, ACL.

[41]  Yen-Chun Chen,et al.  Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting , 2018, ACL.

[42]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[43]  Mirella Lapata,et al.  Ranking Sentences for Extractive Summarization with Reinforcement Learning , 2018, NAACL.

[44]  Furu Wei,et al.  Faithful to the Original: Fact Aware Neural Abstractive Summarization , 2017, AAAI.

[45]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[46]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[47]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[48]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.