Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.

[1]  Chenfei Wu,et al.  TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs , 2023, Intelligent Computing.

[2]  Dan Hendrycks Natural Selection Favors AIs over Humans , 2023, ArXiv.

[3]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[4]  Luke Zettlemoyer,et al.  Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[5]  Jungo Kasai,et al.  Batch Prompting: Efficient Inference with Large Language Model APIs , 2023, ArXiv.

[6]  Alexander H. Miller,et al.  Human-level play in the game of Diplomacy by combining language models with strategic reasoning , 2022, Science.

[7]  Christopher D. Manning,et al.  Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.

[8]  Lisa Anne Hendricks,et al.  Taxonomy of Risks posed by Language Models , 2022, FAccT.

[9]  Joseph Carlsmith Is Power-Seeking AI an Existential Risk? , 2022, ArXiv.

[10]  Dan Hendrycks,et al.  X-Risk Analysis for AI Research , 2022, ArXiv.

[11]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[12]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[13]  Yejin Choi,et al.  Aligning to Social Norms and Values in Interactive Narratives , 2022, NAACL.

[14]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[15]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[16]  D. Song,et al.  What Would Jiminy Cricket Do? Towards Agents That Behave Morally , 2021, NeurIPS Datasets and Benchmarks.

[17]  Nicholas Carlini,et al.  Unsolved Problems in ML Safety , 2021, ArXiv.

[18]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[19]  Ashutosh Modi,et al.  Pre-trained Language Models as Prior Knowledge for Playing Text-based Games , 2021, AAMAS.

[20]  Mark O. Riedl,et al.  Learning Knowledge Graph-based World Models of Textual Environments , 2021, NeurIPS.

[21]  Brent Harrison,et al.  Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior , 2021, ArXiv.

[22]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[23]  Matthew J. Hausknecht,et al.  ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , 2020, ICLR.

[24]  Matthew J. Hausknecht,et al.  Keep CALM and Explore: Language Models for Action Generation in Text-based Games , 2020, EMNLP.

[25]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[26]  D. Song,et al.  Aligning AI With Shared Human Values , 2020, ICLR.

[27]  Matthew J. Hausknecht,et al.  How to Avoid Being Eaten by a Grue: Structured Exploration Strategies for Textual Worlds , 2020, ArXiv.

[28]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[29]  Renuka Kumawat The Third Pillar: How Markets and the State Leave the Community Behind , 2020 .

[30]  William L. Hamilton,et al.  Learning Dynamic Belief Graphs to Generalize on Text-Based Games , 2020, NeurIPS.

[31]  Matthew J. Hausknecht,et al.  Graph Constrained Reinforcement Learning for Natural Language Action Spaces , 2020, ICLR.

[32]  Matthew J. Hausknecht,et al.  Interactive Fiction Games: A Colossal Adventure , 2019, AAAI.

[33]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[34]  Matthew J. Hausknecht,et al.  NAIL: A General Interactive Fiction Agent , 2019, ArXiv.

[35]  Matthew J. Hausknecht,et al.  TextWorld: A Learning Environment for Text-based Games , 2018, CGW@IJCAI.

[36]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[37]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[38]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[39]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[40]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[41]  Thomas Hobbes,et al.  Thomas Hobbes , 1981, A New Modern Philosophy.

[42]  Mark O. Riedl,et al.  Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.

[43]  Jianfeng Gao,et al.  Deep Reinforcement Learning with a Natural Language Action Space , 2015, ACL.

[44]  Andrea Lockerd Thomaz,et al.  Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[45]  Chris Arney Antifragile: Things That Gain from Disorder , 2013 .

[46]  Katharina Pistor A Legal Theory of Finance , 2013 .

[47]  Debraj Ray,et al.  Linking Conflict to Inequality and Polarization , 2011 .

[48]  Linda D. Molm The Structure of Reciprocity , 2010 .

[49]  Shimshon Bichler,et al.  Capital as Power: A Study of Order and Creorder , 2009 .

[50]  K. Neckerman,et al.  Inequality: Causes and Consequences , 2007 .

[51]  R. Dahl The concept of power , 2007 .

[52]  Uri Gneezy,et al.  Deception: The Role of Consequences , 2005 .

[53]  Mark W. Baldwin,et al.  Relational schemas and the processing of social information. , 1992 .

[54]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[55]  Susumu Cato,et al.  Capital in the Twenty-First Century , 2016 .

[56]  Luke Muehlhauser,et al.  Intelligence Explosion: Evidence and Import , 2012 .

[57]  C. Allen,et al.  Stanford Encyclopedia of Philosophy , 2011 .

[58]  A. T. Parsons,et al.  On the Concept of Political Power , 2008 .

[59]  I. Razafimahefa,et al.  Purchasing Power Parity , 2007 .

[60]  GLOSSARY OF INDUSTRIAL ORGANISATION ECONOMICS AND COMPETITION LAW , 1999 .

[61]  M. Castells The Power of Identity , 1997 .

[62]  Charles Darwin,et al.  On the origin of species, 1859 , 1988 .

[63]  J. S. Dowker,et al.  Fundamentals of Physics , 1970, Nature.

[64]  J. R. French,et al.  The bases of social power. , 1959 .

[65]  A. Maslow A Theory of Human Motivation , 1943 .

[66]  Bertrand Russell,et al.  Power: A New Social Analysis , 1938 .

[67]  N. Pierce Origin of Species , 1914, Nature.