论文信息 - Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark - 字舞流文

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.

Dan Hendrycks | Scott Emmons | Steven Basart | Hanlin Zhang | Andy Zou | Alexander Pan | Chan Jun Shern | Nathaniel Li | Thomas Woodside | Jonathan Ng

[1] Chenfei Wu,et al. TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs , 2023, Intelligent Computing.

[2] Dan Hendrycks. Natural Selection Favors AIs over Humans , 2023, ArXiv.

[3] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[4] Luke Zettlemoyer,et al. Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[5] Jungo Kasai,et al. Batch Prompting: Efficient Inference with Large Language Model APIs , 2023, ArXiv.

[6] Alexander H. Miller,et al. Human-level play in the game of Diplomacy by combining language models with strategic reasoning , 2022, Science.

[7] Christopher D. Manning,et al. Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.

[8] Lisa Anne Hendricks,et al. Taxonomy of Risks posed by Language Models , 2022, FAccT.

[9] Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? , 2022, ArXiv.

[10] Dan Hendrycks,et al. X-Risk Analysis for AI Research , 2022, ArXiv.

[11] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[12] Sergio Gomez Colmenarejo,et al. A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[13] Yejin Choi,et al. Aligning to Social Norms and Values in Interactive Narratives , 2022, NAACL.

[14] S. Levine,et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[15] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[16] D. Song,et al. What Would Jiminy Cricket Do? Towards Agents That Behave Morally , 2021, NeurIPS Datasets and Benchmarks.

[17] Nicholas Carlini,et al. Unsolved Problems in ML Safety , 2021, ArXiv.

[18] Owain Evans,et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[19] Ashutosh Modi,et al. Pre-trained Language Models as Prior Knowledge for Playing Text-based Games , 2021, AAMAS.

[20] Mark O. Riedl,et al. Learning Knowledge Graph-based World Models of Textual Environments , 2021, NeurIPS.

[21] Brent Harrison,et al. Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior , 2021, ArXiv.

[22] Liu Yang,et al. Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[23] Matthew J. Hausknecht,et al. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , 2020, ICLR.

[24] Matthew J. Hausknecht,et al. Keep CALM and Explore: Language Models for Action Generation in Text-based Games , 2020, EMNLP.

[25] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[26] D. Song,et al. Aligning AI With Shared Human Values , 2020, ICLR.

[27] Matthew J. Hausknecht,et al. How to Avoid Being Eaten by a Grue: Structured Exploration Strategies for Textual Worlds , 2020, ArXiv.

[28] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[29] Renuka Kumawat. The Third Pillar: How Markets and the State Leave the Community Behind , 2020 .

[30] William L. Hamilton,et al. Learning Dynamic Belief Graphs to Generalize on Text-Based Games , 2020, NeurIPS.

[31] Matthew J. Hausknecht,et al. Graph Constrained Reinforcement Learning for Natural Language Action Spaces , 2020, ICLR.

[32] Matthew J. Hausknecht,et al. Interactive Fiction Games: A Colossal Adventure , 2019, AAAI.

[33] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[34] Matthew J. Hausknecht,et al. NAIL: A General Interactive Fiction Agent , 2019, ArXiv.

[35] Matthew J. Hausknecht,et al. TextWorld: A Learning Environment for Text-based Games , 2018, CGW@IJCAI.

[36] Shie Mannor,et al. Reward Constrained Policy Optimization , 2018, ICLR.

[37] Yuval Tassa,et al. Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[38] Ufuk Topcu,et al. Safe Reinforcement Learning via Shielding , 2017, AAAI.

[39] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[40] Anca D. Dragan,et al. Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[41] Thomas Hobbes,et al. Thomas Hobbes , 1981, A New Modern Philosophy.

[42] Mark O. Riedl,et al. Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.

[43] Jianfeng Gao,et al. Deep Reinforcement Learning with a Natural Language Action Space , 2015, ACL.

[44] Andrea Lockerd Thomaz,et al. Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[45] Chris Arney. Antifragile: Things That Gain from Disorder , 2013 .

[46] Katharina Pistor. A Legal Theory of Finance , 2013 .

[47] Debraj Ray,et al. Linking Conflict to Inequality and Polarization , 2011 .

[48] Linda D. Molm. The Structure of Reciprocity , 2010 .

[49] Shimshon Bichler,et al. Capital as Power: A Study of Order and Creorder , 2009 .

[50] K. Neckerman,et al. Inequality: Causes and Consequences , 2007 .

[51] R. Dahl. The concept of power , 2007 .

[52] Uri Gneezy,et al. Deception: The Role of Consequences , 2005 .

[53] Mark W. Baldwin,et al. Relational schemas and the processing of social information. , 1992 .

[54] Dario Amodei,et al. Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[55] Susumu Cato,et al. Capital in the Twenty-First Century , 2016 .

[56] Luke Muehlhauser,et al. Intelligence Explosion: Evidence and Import , 2012 .

[57] C. Allen,et al. Stanford Encyclopedia of Philosophy , 2011 .

[58] A. T. Parsons,et al. On the Concept of Political Power , 2008 .

[59] I. Razafimahefa,et al. Purchasing Power Parity , 2007 .

[60] GLOSSARY OF INDUSTRIAL ORGANISATION ECONOMICS AND COMPETITION LAW , 1999 .

[61] M. Castells. The Power of Identity , 1997 .

[62] Charles Darwin,et al. On the origin of species, 1859 , 1988 .

[63] J. S. Dowker,et al. Fundamentals of Physics , 1970, Nature.

[64] J. R. French,et al. The bases of social power. , 1959 .

[65] A. Maslow. A Theory of Human Motivation , 1943 .

[66] Bertrand Russell,et al. Power: A New Social Analysis , 1938 .

[67] N. Pierce. Origin of Species , 1914, Nature.