Model evaluation for extreme risks

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through"dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through"alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

[1]  Christopher D. Manning,et al.  Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.

[2]  János Kramár,et al.  Power-seeking can be probable and predictive for trained agents , 2023, ArXiv.

[3]  Dan Hendrycks,et al.  Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark , 2023, ICML.

[4]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[5]  Tantum Collins,et al.  Exploring the Relevance of Data Privacy-Enhancing Technologies for AI Governance Use Cases , 2023, ArXiv.

[6]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[7]  A. Dragan,et al.  Automatically Auditing Large Language Models via Discrete Optimization , 2023, ICML.

[8]  Dmitrii Krasheninnikov,et al.  Harms from Increasingly Agentic Algorithmic Systems , 2023, FAccT.

[9]  Hannah Rose Kirk,et al.  Auditing large language models: a three-layered approach , 2023, SSRN Electronic Journal.

[10]  J. Steinhardt,et al.  Progress measures for grokking via mechanistic interpretability , 2023, ICLR.

[11]  Tom B. Brown,et al.  Discovering Language Model Behaviors with Model-Written Evaluations , 2022, ACL.

[12]  D. Klein,et al.  Discovering Latent Knowledge in Language Models Without Supervision , 2022, ICLR.

[13]  Quoc V. Le,et al.  Inverse scaling can become U-shaped , 2022, EMNLP.

[14]  J. Schulman,et al.  Scaling Laws for Reward Model Overoptimization , 2022, ICML.

[15]  Joshua A. Kroll,et al.  From Plane Crashes to Algorithmic Harm: Applicability of Safety Engineering Frameworks for Responsible ML , 2022, CHI.

[16]  Rohin Shah,et al.  Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals , 2022, ArXiv.

[17]  Lisa Anne Hendricks,et al.  Improving alignment of dialogue agents via targeted human judgements , 2022, ArXiv.

[18]  Richard Ngo The alignment problem from a deep learning perspective , 2022, ICLR.

[19]  Richard Yuanzhe Pang,et al.  What Do NLP Researchers Believe? Results of the NLP Community Metasurvey , 2022, ACL.

[20]  Jonathan G. Richens,et al.  Discovering Agents , 2022, Artif. Intell..

[21]  Joshua Achiam,et al.  A Hazard Analysis Framework for Code Synthesis Large Language Models , 2022, ArXiv.

[22]  Inioluwa Deborah Raji,et al.  The Fallacy of AI Functionality , 2022, FAccT.

[23]  Joseph Carlsmith Is Power-Seeking AI an Existential Risk? , 2022, ArXiv.

[24]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[25]  Inioluwa Deborah Raji,et al.  Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance , 2022, AIES.

[26]  Daniel M. Ziegler,et al.  Adversarial Training for High-Stakes Reliability , 2022, NeurIPS.

[27]  Tom B. Brown,et al.  Predictability and Surprise in Large Generative Models , 2022, FAccT.

[28]  Geoffrey Irving,et al.  Red Teaming Language Models with Language Models , 2022, EMNLP.

[29]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[30]  Jess Whittlestone,et al.  Why and How Governments Should Monitor AI Development , 2021, ArXiv.

[31]  Pedro A. Ortega,et al.  Agent Incentives: A Causal Perspective , 2021, AAAI.

[32]  Peter Henderson,et al.  Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims , 2020, ArXiv.

[33]  Nick Cammarata,et al.  Zoom In: An Introduction to Circuits , 2020 .

[34]  Inioluwa Deborah Raji,et al.  Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing , 2020, FAT*.

[35]  Rohin Shah,et al.  Optimal Policies Tend To Seek Power , 2019, NeurIPS.

[36]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[37]  Hyrum S. Anderson,et al.  The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation , 2018, ArXiv.

[38]  Anca D. Dragan,et al.  The Off-Switch Game , 2016, IJCAI.

[39]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.