Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models
暂无分享,去创建一个
Ang Lv | Yuhan Chen | Jian Xie | Lifeng Liu | Ji-Rong Wen | Yulong Wang | Kaiyi Zhang | Rui Yan
[1] Benjamin Heinzerling,et al. Monotonic Representation of Numeric Properties in Language Models , 2024, ArXiv.
[2] Mrinmaya Sachan,et al. Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals , 2024, ArXiv.
[3] V. Veselovsky,et al. Do Llamas Work in English? On the Latent Language of Multilingual Transformers , 2024, ArXiv.
[4] Quan Tu,et al. CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation , 2024, ArXiv.
[5] Zeping Yu,et al. Locating Factual Knowledge in Large Language Models: Exploring the Residual Stream and Analyzing Subvalues in Vocabulary Space , 2023, 2312.12141.
[6] Yuchuan Wu,et al. Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use , 2023, ArXiv.
[7] Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task , 2023, ArXiv.
[8] Aleksandar Makelov,et al. Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching , 2023, ArXiv.
[9] Shufang Xie,et al. Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse , 2023, ArXiv.
[10] Ellie Pavlick,et al. Characterizing Mechanisms for Factual Recall in Language Models , 2023, EMNLP.
[11] Ellie Pavlick,et al. Circuit Component Reuse Across Tasks in Transformer Language Models , 2023, ArXiv.
[12] Can Rager,et al. An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l , 2023, ArXiv.
[13] Guangxuan Xiao,et al. Efficient Streaming Language Models with Attention Sinks , 2023, ArXiv.
[14] Fred Zhang,et al. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods , 2023, ICLR.
[15] S. Legg,et al. The Hydra Effect: Emergent Self-repair in Language Model Computations , 2023, ArXiv.
[16] J. Z. Kolter,et al. Universal and Transferable Adversarial Attacks on Aligned Language Models , 2023, ArXiv.
[17] Rohin Shah,et al. Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , 2023, ArXiv.
[18] Eric Michael Smith,et al. Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.
[19] Kentaro Inui,et al. Transformer Language Models Handle Word Frequency in Prediction Head , 2023, ACL.
[20] A. Globerson,et al. Dissecting Recall of Factual Associations in Auto-Regressive Language Models , 2023, EMNLP.
[21] Augustine N. Mavor-Parker,et al. Towards Automated Circuit Discovery for Mechanistic Interpretability , 2023, NeurIPS.
[22] Wayne Xin Zhao,et al. A Survey of Large Language Models , 2023, ArXiv.
[23] J. Steinhardt,et al. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.
[24] Tom B. Brown,et al. In-context Learning and Induction Heads , 2022, ArXiv.
[25] Dario Amodei,et al. Toy Models of Superposition , 2022, ArXiv.
[26] David Bau,et al. Locating and Editing Factual Associations in GPT , 2022, NeurIPS.
[27] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[28] Tom B. Brown,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[29] Nick Cammarata,et al. Zoom In: An Introduction to Circuits , 2020 .
[30] Hallucination , 2020, Definitions.
[31] Tom B. Brown,et al. Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.
[32] Giulia Battistoni,et al. Causality , 2019, Mind and the Present.
[33] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[34] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[35] M. Mercer. Siren's song. , 2014, Annals of emergency medicine.
[36] Judea Pearl,et al. Direct and Indirect Effects , 2001, UAI.
[37] Huaiyu Zhu. On Information and Sufficiency , 1997 .
[38] Stylized Dialogue Generation with Multi-Pass Dual Learning , 2021, NeurIPS.
[39] Yonatan Belinkov,et al. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.
[40] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .