Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
暂无分享,去创建一个
[1] Bahjat Kawar,et al. Editing Implicit Assumptions in Text-to-Image Diffusion Models , 2023, ArXiv.
[2] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[3] Mohit Bansal,et al. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , 2023, ArXiv.
[4] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.
[5] Tom B. Brown,et al. Discovering Language Model Behaviors with Model-Written Evaluations , 2022, ACL.
[6] Marco Tulio Ribeiro,et al. Editing Models with Task Arithmetic , 2022, ICLR.
[7] D. Klein,et al. Discovering Latent Knowledge in Language Models Without Supervision , 2022, ICLR.
[8] David Bau,et al. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , 2022, ICLR.
[9] Francesco Locatello,et al. Relative representations enable zero-shot latent space communication , 2022, ICLR.
[10] Tom B. Brown,et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , 2022, ArXiv.
[11] Tom B. Brown,et al. Language Models (Mostly) Know What They Know , 2022, ArXiv.
[12] Jeff Wu,et al. Self-critiquing models for assisting human evaluators , 2022, ArXiv.
[13] Xiang Lisa Li,et al. Diffusion-LM Improves Controllable Text Generation , 2022, NeurIPS.
[14] Matthew E. Peters,et al. Extracting Latent Steering Vectors from Pretrained Language Models , 2022, FINDINGS.
[15] Tom B. Brown,et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.
[16] Jacob Menick,et al. Teaching language models to support answers with verified quotes , 2022, ArXiv.
[17] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[18] David Bau,et al. Locating and Editing Factual Associations in GPT , 2022, NeurIPS.
[19] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.
[20] Jeff Wu,et al. WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.
[21] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.
[22] Dario Amodei,et al. A General Language Assistant as a Laboratory for Alignment , 2021, ArXiv.
[23] Owain Evans,et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.
[24] Douwe Kiela,et al. True Few-Shot Learning with Language Models , 2021, NeurIPS.
[25] Jason Weston,et al. Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.
[26] Yonatan Belinkov,et al. Probing Classifiers: Promises, Shortcomings, and Advances , 2021, CL.
[27] Dawn Song,et al. Language Models are Open Knowledge Graphs , 2020, ArXiv.
[28] Shafiq R. Joty,et al. GeDi: Generative Discriminator Guided Sequence Generation , 2020, EMNLP.
[29] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[30] J. Yosinski,et al. Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.
[31] Tom B. Brown,et al. Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.
[32] Ming-Wei Chang,et al. Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.
[33] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.
[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[35] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.
[36] Ilya Sutskever,et al. Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.
[37] Serge J. Belongie,et al. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[38] Yoshua Bengio,et al. Understanding intermediate layers using linear classifier probes , 2016, ICLR.
[39] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[40] Jacob Andreas,et al. Measuring and Manipulating Knowledge Representations in Language Models , 2023, ArXiv.
[41] Jon Zarley. The Most Important Meal of the Day , 1978 .