Model evaluation for extreme risks
暂无分享,去创建一个
P. Christiano | Jack Clark | Y. Bengio | A. Dafoe | Sebastian Farquhar | Mary Phuong | Jess Whittlestone | Ben Garfinkel | Nahema Marchal | S. Avin | Divya Siddarth | Daniel Kokotajlo | Noam Kolt | Been Kim | W. Hawkins | Iason Gabriel | Markus Anderljung | Toby Shevlane | Jade Leung | Lewis Ho | Vijay Bolina | Divya Siddarth
[1] Christopher D. Manning,et al. Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.
[2] János Kramár,et al. Power-seeking can be probable and predictive for trained agents , 2023, ArXiv.
[3] Dan Hendrycks,et al. Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark , 2023, ICML.
[4] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.
[5] Tantum Collins,et al. Exploring the Relevance of Data Privacy-Enhancing Technologies for AI Governance Use Cases , 2023, ArXiv.
[6] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.
[7] A. Dragan,et al. Automatically Auditing Large Language Models via Discrete Optimization , 2023, ICML.
[8] Dmitrii Krasheninnikov,et al. Harms from Increasingly Agentic Algorithmic Systems , 2023, FAccT.
[9] Hannah Rose Kirk,et al. Auditing large language models: a three-layered approach , 2023, SSRN Electronic Journal.
[10] J. Steinhardt,et al. Progress measures for grokking via mechanistic interpretability , 2023, ICLR.
[11] Tom B. Brown,et al. Discovering Language Model Behaviors with Model-Written Evaluations , 2022, ACL.
[12] D. Klein,et al. Discovering Latent Knowledge in Language Models Without Supervision , 2022, ICLR.
[13] Quoc V. Le,et al. Inverse scaling can become U-shaped , 2022, EMNLP.
[14] J. Schulman,et al. Scaling Laws for Reward Model Overoptimization , 2022, ICML.
[15] Joshua A. Kroll,et al. From Plane Crashes to Algorithmic Harm: Applicability of Safety Engineering Frameworks for Responsible ML , 2022, CHI.
[16] Rohin Shah,et al. Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals , 2022, ArXiv.
[17] Lisa Anne Hendricks,et al. Improving alignment of dialogue agents via targeted human judgements , 2022, ArXiv.
[18] Richard Ngo. The alignment problem from a deep learning perspective , 2022, ICLR.
[19] Richard Yuanzhe Pang,et al. What Do NLP Researchers Believe? Results of the NLP Community Metasurvey , 2022, ACL.
[20] Jonathan G. Richens,et al. Discovering Agents , 2022, Artif. Intell..
[21] Joshua Achiam,et al. A Hazard Analysis Framework for Code Synthesis Large Language Models , 2022, ArXiv.
[22] Inioluwa Deborah Raji,et al. The Fallacy of AI Functionality , 2022, FAccT.
[23] Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? , 2022, ArXiv.
[24] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..
[25] Inioluwa Deborah Raji,et al. Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance , 2022, AIES.
[26] Daniel M. Ziegler,et al. Adversarial Training for High-Stakes Reliability , 2022, NeurIPS.
[27] Tom B. Brown,et al. Predictability and Surprise in Large Generative Models , 2022, FAccT.
[28] Geoffrey Irving,et al. Red Teaming Language Models with Language Models , 2022, EMNLP.
[29] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.
[30] Jess Whittlestone,et al. Why and How Governments Should Monitor AI Development , 2021, ArXiv.
[31] Pedro A. Ortega,et al. Agent Incentives: A Causal Perspective , 2021, AAAI.
[32] Peter Henderson,et al. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims , 2020, ArXiv.
[33] Nick Cammarata,et al. Zoom In: An Introduction to Circuits , 2020 .
[34] Inioluwa Deborah Raji,et al. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing , 2020, FAT*.
[35] Rohin Shah,et al. Optimal Policies Tend To Seek Power , 2019, NeurIPS.
[36] Inioluwa Deborah Raji,et al. Model Cards for Model Reporting , 2018, FAT.
[37] Hyrum S. Anderson,et al. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation , 2018, ArXiv.
[38] Anca D. Dragan,et al. The Off-Switch Game , 2016, IJCAI.
[39] Laurent Orseau,et al. Safely Interruptible Agents , 2016, UAI.