Sparse Autoencoders Find Highly Interpretable Features in Language Models
暂无分享,去创建一个
[1] Dan Hendrycks,et al. An Overview of Catastrophic AI Risks , 2023, ArXiv.
[2] Stella Rose Biderman,et al. LEACE: Perfect linear concept erasure in closed form , 2023, ArXiv.
[3] Augustine N. Mavor-Parker,et al. Towards Automated Circuit Discovery for Mechanistic Interpretability , 2023, ArXiv.
[4] Oskar van der Wal,et al. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ICML.
[5] Graham Bex-Priestley. Gender as Name , 2022, Journal of Ethics and Social Philosophy.
[6] J. Steinhardt,et al. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.
[7] Dario Amodei,et al. Toy Models of Superposition , 2022, ArXiv.
[8] Richard Ngo. The alignment problem from a deep learning perspective , 2022, ICLR.
[9] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.
[10] Yoav Goldberg,et al. Linear Adversarial Concept Erasure , 2022, ICML.
[11] Christopher Potts,et al. Causal Abstractions of Neural Networks , 2021, NeurIPS.
[12] Isabelle Augenstein,et al. Is Sparse Attention more Interpretable? , 2021, ACL.
[13] Yann LeCun,et al. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors , 2021, DEELIO.
[14] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[15] Nick Cammarata,et al. Zoom In: An Introduction to Circuits , 2020 .
[16] Zhihui Zhu,et al. Analysis of the Optimization Landscapes for Overcomplete Representation Learning , 2019, ArXiv.
[17] André F. T. Martins,et al. Adaptively Sparse Transformers , 2019, EMNLP.
[18] Georgios Georgiadis,et al. Accelerating Convolutional Neural Networks via Activation Map Compression , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Michael Carbin,et al. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.
[20] Rajat Raina,et al. Efficient sparse coding algorithms , 2006, NIPS.
[21] Bruno A Olshausen,et al. Sparse coding of sensory inputs , 2004, Current Opinion in Neurobiology.
[22] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.
[23] Kunihiko Fukushima,et al. Cognitron: A self-organizing multilayered neural network , 1975, Biological Cybernetics.
[24] Richard Ngo,et al. The alignment problem from a deep learning perspective , 2022 .
[25] Hiroya Inakoshi,et al. Elite BackProp: Training Sparse Interpretable Neurons , 2021, NeSy.
[26] Yonatan Belinkov,et al. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.