Towards Automated Circuit Discovery for Mechanistic Interpretability
暂无分享,去创建一个
[1] Neel Nanda,et al. Copy Suppression: Comprehensively Understanding an Attention Head , 2023, ArXiv.
[2] Weiping Wang,et al. A Survey on Model Compression for Large Language Models , 2023, ArXiv.
[3] Dan Hendrycks,et al. An Overview of Catastrophic AI Risks , 2023, ArXiv.
[4] Noah D. Goodman,et al. Interpretability at Scale: Identifying Causal Mechanisms in Alpaca , 2023, ArXiv.
[5] D. Bertsimas,et al. Finding Neurons in a Haystack: Case Studies with Sparse Probing , 2023, Trans. Mach. Learn. Res..
[6] Michael Hanna,et al. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model , 2023, ArXiv.
[7] Aryaman Arora,et al. Localizing Model Behavior with Path Patching , 2023, ArXiv.
[8] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.
[9] Lawrence Chan,et al. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations , 2023, ICML.
[10] J. Steinhardt,et al. Progress measures for grokking via mechanistic interpretability , 2023, ICLR.
[11] Tom McGrath,et al. Tracr: Compiled Transformers as a Laboratory for Interpretability , 2023, NeurIPS.
[12] Dan Alistarh,et al. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot , 2023, ICML.
[13] Khaled Kamal Saab,et al. Hungry Hungry Hippos: Towards Language Modeling with State Space Models , 2022, ICLR.
[14] Ruskin Raj Manku,et al. DeepCuts: Single-Shot Interpretability based Pruning for BERT , 2022, ArXiv.
[15] J. Steinhardt,et al. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.
[16] Dan Alistarh,et al. GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods , 2022, 2210.06384.
[17] Tom B. Brown,et al. In-context Learning and Induction Heads , 2022, ArXiv.
[18] Dario Amodei,et al. Toy Models of Superposition , 2022, ArXiv.
[19] Dylan Hadfield-Menell,et al. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).
[20] Matt J. Kusner,et al. Causal Machine Learning: A Survey and Open Problems , 2022, ArXiv.
[21] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..
[22] Dan Hendrycks,et al. X-Risk Analysis for AI Research , 2022, ArXiv.
[23] Michael W. Mahoney,et al. A Fast Post-Training Pruning Framework for Transformers , 2022, NeurIPS.
[24] Dan Alistarh,et al. The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models , 2022, EMNLP.
[25] David Bau,et al. Locating and Editing Factual Associations in GPT , 2022, NeurIPS.
[26] A. Torralba,et al. Natural Language Descriptions of Deep Visual Features , 2022, ICLR.
[27] Noah D. Goodman,et al. Causal Distillation for Language Models , 2021, NAACL.
[28] L. Zappella,et al. Self-conditioning pre-trained language models , 2021, ICML.
[29] Eran Yahav,et al. Thinking Like Transformers , 2021, ICML.
[30] Christopher Potts,et al. Causal Abstractions of Neural Networks , 2021, NeurIPS.
[31] Martin Wattenberg,et al. An Interpretability Illusion for BERT , 2021, ArXiv.
[32] Alexander M. Rush,et al. Low-Complexity Probing via Finding Subnetworks , 2021, NAACL.
[33] Ludwig Schubert,et al. High/Low frequency detectors , 2021 .
[34] Omer Levy,et al. Transformer Feed-Forward Layers Are Key-Value Memories , 2020, EMNLP.
[35] Peter Tiňo,et al. A Survey on Neural Network Interpretability , 2020, IEEE Transactions on Emerging Topics in Computational Intelligence.
[36] Evan Hubinger,et al. An overview of 11 proposals for building safe advanced AI , 2020, ArXiv.
[37] David Bau,et al. Rewriting a Deep Generative Model , 2020, ECCV.
[38] Jacob Andreas,et al. Compositional Explanations of Neurons , 2020, NeurIPS.
[39] Uri Shalit,et al. CausaLM: Causal Model Explanation Through Counterfactual Language Models , 2020, CL.
[40] Alexander M. Rush,et al. Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.
[41] Yonatan Belinkov,et al. Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias , 2020, ArXiv.
[42] Yoav Goldberg,et al. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.
[43] Nick Cammarata,et al. Thread: Circuits , 2020, Distill.
[44] Nick Cammarata,et al. Zoom In: An Introduction to Circuits , 2020 .
[45] Jose Javier Gonzalez Ortiz,et al. What is the State of Neural Network Pruning? , 2020, MLSys.
[46] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[47] Ali Farhadi,et al. What’s Hidden in a Randomly Weighted Neural Network? , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Michael Arens,et al. Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey , 2019, Mach. Learn. Knowl. Extr..
[49] N. C. Chung,et al. Concept Saliency Maps to Visualize Relevant Features in Deep Generative Models , 2019, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA).
[50] Ziheng Wang,et al. Structured Pruning of Large Language Models , 2019, EMNLP.
[51] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.
[52] Grzegorz Chrupala,et al. Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop , 2019, Natural Language Engineering.
[53] Samuel R. Bowman,et al. Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.
[54] P. Krogh,et al. Corpus , 2018, Encyclopedia of Database Systems.
[55] Been Kim,et al. Sanity Checks for Saliency Maps , 2018, NeurIPS.
[56] Hyrum S. Anderson,et al. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation , 2018, ArXiv.
[57] Diederik P. Kingma,et al. Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.
[58] Cengiz Öztireli,et al. Towards better understanding of gradient-based attribution methods for Deep Neural Networks , 2017, ICLR.
[59] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[60] Andrea Vedaldi,et al. Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[61] Ankur Taly,et al. Axiomatic Attribution for Deep Networks , 2017, ICML.
[62] Been Kim,et al. Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.
[63] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.
[64] Zachary Chase Lipton. The mythos of model interpretability , 2016, ACM Queue.
[65] Serge J. Belongie,et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.
[66] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.
[67] Xinlei Chen,et al. Visualizing and Understanding Neural Models in NLP , 2015, NAACL.
[68] L. Koziol,et al. Summary , 2014, Applied neuropsychology. Child.
[69] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.
[70] Lihi Zelnik-Manor,et al. Context-aware saliency detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[71] Tom Fawcett,et al. An introduction to ROC analysis , 2006, Pattern Recognit. Lett..
[72] Deborah Silver,et al. Feature Visualization , 1994, Scientific Visualization.
[73] Babak Hassibi,et al. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.
[74] Illtyd Trethowan. Causality , 1938 .
[75] Dylan Hadfield-Menell,et al. Benchmarking Interpretability Tools for Deep Neural Networks , 2023, ArXiv.
[76] Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020 , 2020, MLSys.
[77] Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada , 2019, NeurIPS.
[78] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[79] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[80] David Aldous. Toy models , 2019, Multiple Scattering Theory: Electronic structure of solids.
[81] Marco Tulio Ribeiro,et al. “ Why Should I Trust You ? ” Explaining the Predictions of Any Classifier , 2016 .
[82] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .
[83] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.