SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules. In this work, we propose"SelfCheckGPT", a simple sampling-based approach that can be used to fact-check black-box models in a zero-resource fashion, i.e. without an external database. SelfCheckGPT leverages the simple idea that if a LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages. We demonstrate that SelfCheckGPT can: i) detect non-factual and factual sentences; and ii) rank passages in terms of factuality. We compare our approach to several baselines and show that in sentence hallucination detection, our approach has AUC-PR scores comparable to or better than grey-box methods, while SelfCheckGPT is best at passage factuality assessment.

[1]  Tom M. Mitchell,et al.  The Internal State of an LLM Knows When its Lying , 2023, ArXiv.

[2]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[3]  Y. Gal,et al.  Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , 2023, ICLR.

[4]  Pengfei Liu,et al.  GPTScore: Evaluate as You Desire , 2023, NAACL.

[5]  M. Gales,et al.  MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization , 2023, ArXiv.

[6]  M. Gales,et al.  “World Knowledge” in Multiple Choice Reading Comprehension , 2022, FEVER.

[7]  Pascale Fung,et al.  Survey of Hallucination in Natural Language Generation , 2022, ACM Comput. Surv..

[8]  Tom B. Brown,et al.  Language Models (Mostly) Know What They Know , 2022, ArXiv.

[9]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[10]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[11]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[12]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[13]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[14]  Andreas Vlachos,et al.  A Survey on Automated Fact-Checking , 2021, TACL.

[15]  M. Gales,et al.  Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension , 2022, FINDINGS.

[16]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[17]  Mark J. F. Gales,et al.  Uncertainty Estimation in Autoregressive Structured Prediction , 2021, ICLR.

[18]  Bing Qin,et al.  The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey , 2021, ArXiv.

[19]  Jason Weston,et al.  Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.

[20]  William Yang Wang,et al.  On Hallucination and Predictive Uncertainty in Conditional Language Generation , 2021, EACL.

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[23]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[24]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[25]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[26]  M. Zhou,et al.  Reasoning Over Semantic-Level Graph for Fact Checking , 2019, ACL.

[27]  Zhao Hai,et al.  Semantics-aware BERT for Language Understanding , 2019, AAAI.

[28]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[29]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[30]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Andreas Vlachos,et al.  The Fact Extraction and VERification (FEVER) Shared Task , 2018, FEVER@EMNLP.

[33]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[34]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[35]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[36]  David Grangier,et al.  Generating Text from Structured Data with Application to the Biography Domain , 2016, ArXiv.

[37]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[38]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .