Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the truthfulness of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

[1]  Bahjat Kawar,et al.  Editing Implicit Assumptions in Text-to-Image Diffusion Models , 2023, ArXiv.

[2]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[3]  Mohit Bansal,et al.  Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , 2023, ArXiv.

[4]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[5]  Tom B. Brown,et al.  Discovering Language Model Behaviors with Model-Written Evaluations , 2022, ACL.

[6]  Marco Tulio Ribeiro,et al.  Editing Models with Task Arithmetic , 2022, ICLR.

[7]  D. Klein,et al.  Discovering Latent Knowledge in Language Models Without Supervision , 2022, ICLR.

[8]  David Bau,et al.  Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , 2022, ICLR.

[9]  Francesco Locatello,et al.  Relative representations enable zero-shot latent space communication , 2022, ICLR.

[10]  Tom B. Brown,et al.  Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , 2022, ArXiv.

[11]  Tom B. Brown,et al.  Language Models (Mostly) Know What They Know , 2022, ArXiv.

[12]  Jeff Wu,et al.  Self-critiquing models for assisting human evaluators , 2022, ArXiv.

[13]  Xiang Lisa Li,et al.  Diffusion-LM Improves Controllable Text Generation , 2022, NeurIPS.

[14]  Matthew E. Peters,et al.  Extracting Latent Steering Vectors from Pretrained Language Models , 2022, FINDINGS.

[15]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[16]  Jacob Menick,et al.  Teaching language models to support answers with verified quotes , 2022, ArXiv.

[17]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[18]  David Bau,et al.  Locating and Editing Factual Associations in GPT , 2022, NeurIPS.

[19]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[20]  Jeff Wu,et al.  WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.

[21]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[22]  Dario Amodei,et al.  A General Language Assistant as a Laboratory for Alignment , 2021, ArXiv.

[23]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[24]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[25]  Jason Weston,et al.  Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.

[26]  Yonatan Belinkov,et al.  Probing Classifiers: Promises, Shortcomings, and Advances , 2021, CL.

[27]  Dawn Song,et al.  Language Models are Open Knowledge Graphs , 2020, ArXiv.

[28]  Shafiq R. Joty,et al.  GeDi: Generative Discriminator Guided Sequence Generation , 2020, EMNLP.

[29]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[30]  J. Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[31]  Tom B. Brown,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[32]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[33]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[36]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[37]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[39]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[40]  Jacob Andreas,et al.  Measuring and Manipulating Knowledge Representations in Language Models , 2023, ArXiv.

[41]  Jon Zarley The Most Important Meal of the Day , 1978 .