Locating Cross-Task Sequence Continuation Circuits in Transformers
暂无分享,去创建一个
[1] Ellie Pavlick,et al. Circuit Component Reuse Across Tasks in Transformer Language Models , 2023, ArXiv.
[2] Fazl Barez,et al. Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders , 2023, ArXiv.
[3] J. Steinhardt,et al. Overthinking the Truth: Understanding how Language Models Process False Demonstrations , 2023, ArXiv.
[4] Shay B. Cohen,et al. Neuron to Graph: Interpreting Language Model Neurons at Scale , 2023, ArXiv.
[5] Ioannis Konstas,et al. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark , 2023, ACL.
[6] Michael Hanna,et al. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model , 2023, ArXiv.
[7] Augustine N. Mavor-Parker,et al. Towards Automated Circuit Discovery for Mechanistic Interpretability , 2023, ArXiv.
[8] Aryaman Arora,et al. Localizing Model Behavior with Path Patching , 2023, ArXiv.
[9] Arnab Sen Sharma,et al. Mass-Editing Memory in a Transformer , 2022, ICLR.
[10] Tom B. Brown,et al. In-context Learning and Induction Heads , 2022, ArXiv.
[11] Dario Amodei,et al. Toy Models of Superposition , 2022, ArXiv.
[12] Dylan Hadfield-Menell,et al. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).
[13] Dan Hendrycks,et al. X-Risk Analysis for AI Research , 2022, ArXiv.
[14] Sardar Jaf,et al. A Literature Survey of Recent Advances in Chatbots , 2021, Inf..
[15] Tom Henighan,et al. Scaling Laws for Transfer , 2021, ArXiv.
[16] Omer Levy,et al. Transformer Feed-Forward Layers Are Key-Value Memories , 2020, EMNLP.
[17] Jessica Taylor,et al. Alignment for Advanced Machine Learning Systems , 2020, Ethics of Artificial Intelligence.
[18] Jacob Andreas,et al. Compositional Explanations of Neurons , 2020, NeurIPS.
[19] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[20] Nick Cammarata,et al. Zoom In: An Introduction to Circuits , 2020 .
[21] Alejandro Barredo Arrieta,et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2019, Inf. Fusion.
[22] Bolei Zhou,et al. Interpreting the Latent Space of GANs for Semantic Face Editing , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Geoffrey E. Hinton,et al. Similarity of Neural Network Representations Revisited , 2019, ICML.
[24] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[25] C. Robert. Superintelligence: Paths, Dangers, Strategies , 2017 .
[26] John Schulman,et al. Concrete Problems in AI Safety , 2016, ArXiv.
[27] Jason Yosinski,et al. Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks , 2016, ArXiv.
[28] Hod Lipson,et al. Convergent Learning: Do different neural networks learn the same representations? , 2015, FE@NIPS.
[29] 오상록,et al. Domain Knowledge를 이용한 강화 학습 , 2001 .
[30] Huaiyu Zhu. On Information and Sufficiency , 1997 .
[31] Deborah Silver,et al. Feature Visualization , 1994, Scientific Visualization.
[32] Yonatan Belinkov,et al. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.
[33] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .