AI Safety Gridworlds

We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environment with a performance function that is hidden from the agent. This allows us to categorize AI safety problems into robustness and specification problems, depending on whether the performance function corresponds to the observed reward function. We evaluate A2C and Rainbow, two recent deep reinforcement learning agents, on our environments and show that they are not able to solve them satisfactorily.

[1]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Min Wu,et al.  Safety Verification of Deep Neural Networks , 2016, CAV.

[4]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[5]  Mark O. Riedl,et al.  Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.

[6]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[7]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[8]  Tomás Svoboda,et al.  Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[9]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[10]  Aleksandrs Slivkins,et al.  One Practical Algorithm for Both Stochastic and Adversarial Bandits , 2014, ICML.

[11]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[12]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[13]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[14]  S. Armstrong Towards Interactive Inverse Reinforcement Learning , 2016 .

[15]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[16]  Owain Evans,et al.  Trial without Error: Towards Safe Reinforcement Learning via Human Intervention , 2017, AAMAS.

[17]  Oren Etzioni,et al.  The First Law of Robotics (A Call to Arms) , 1994, AAAI.

[18]  Jianfeng Gao,et al.  Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear , 2016, ArXiv.

[19]  Stephen M. Omohundro,et al.  The Basic AI Drives , 2008, AGI.

[20]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[21]  Marcus Hutter,et al.  Self-Modification of Policy and Utility Function in Rational Agents , 2016, AGI.

[22]  Peter Auer,et al.  An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits , 2016, COLT.

[23]  Guan Wang,et al.  Interactive Learning from Policy-Dependent Human Feedback , 2017, ICML.

[24]  Bill Hibbard,et al.  Model-based Utility Functions , 2011, J. Artif. Gen. Intell..

[25]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[26]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[27]  Christoph Salge,et al.  Empowerment - an Introduction , 2013, ArXiv.

[28]  Finale Doshi-Velez,et al.  A Roadmap for a Rigorous Science of Interpretability , 2017, ArXiv.

[29]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[30]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[31]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[32]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[33]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[34]  Sanjit A. Seshia,et al.  Towards Verified Artificial Intelligence , 2016, ArXiv.

[35]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[36]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[37]  Hilbert J. Kappen,et al.  Risk Sensitive Path Integral Control , 2010, UAI.

[38]  Jessica Taylor,et al.  Alignment for Advanced Machine Learning Systems , 2020, Ethics of Artificial Intelligence.

[39]  Laurent Orseau,et al.  Self-Modification and Mortality in Artificial Agents , 2011, AGI.

[40]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[41]  Peter Whittle,et al.  Optimal Control: Basics and Beyond , 1996 .

[42]  Mark O. Riedl,et al.  Enter the Matrix: A Virtual World Approach to Safely Interruptable Autonomous Systems , 2017, ArXiv.

[43]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[44]  Jordi Grau-Moya,et al.  Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes , 2016, ECML/PKDD.

[45]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[46]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[47]  Klaus Obermayer,et al.  A unified framework for risk-sensitive Markov control processes , 2014, 53rd IEEE Conference on Decision and Control.

[48]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[49]  Randy H. Katz,et al.  A Berkeley View of Systems Challenges for AI , 2017, ArXiv.

[50]  Sandy H. Huang,et al.  Adversarial Attacks on Neural Network Policies , 2017, ICLR.

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Balaraman Ravindran,et al.  RAIL: Risk-Averse Imitation Learning , 2018, AAMAS.

[53]  Nick Bostrom,et al.  Superintelligence: Paths, Dangers, Strategies , 2014 .

[54]  Marcus Hutter,et al.  Sequential Extensions of Causal and Evidential Decision Theory , 2015, ADT.

[55]  Anca D. Dragan,et al.  The Off-Switch Game , 2016, IJCAI.

[56]  Stuart Armstrong,et al.  Low Impact Artificial Intelligences , 2017, ArXiv.

[57]  Aleksandrs Slivkins,et al.  25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits , 2022 .

[58]  Tom Schaul,et al.  Learning from Demonstrations for Real World Reinforcement Learning , 2017, ArXiv.

[59]  Benja Fallenstein,et al.  Aligning Superintelligence with Human Interests: A Technical Research Agenda , 2015 .

[60]  Laurent Orseau,et al.  Space-Time Embedded Intelligence , 2012, AGI.

[61]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[62]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[63]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[64]  Stuart J. Russell,et al.  Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[65]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[66]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[67]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[68]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[69]  Zachary Chase Lipton,et al.  Combating Deep Reinforcement Learning's Sisyphean Curse with Intrinsic Fear , 2016, 1611.01211.

[70]  Jessica Taylor,et al.  Quantilizers: A Safer Alternative to Maximizers for Limited Optimization , 2016, AAAI Workshop: AI, Ethics, and Society.

[71]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[72]  Ulrich Berger,et al.  Brown's original fictitious play , 2007, J. Econ. Theory.

[73]  Laurent Orseau,et al.  Delusion, Survival, and Intelligent Agents , 2011, AGI.

[74]  Stuart Russell Should We Fear Supersmart Robots? , 2016, Scientific American.

[75]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[77]  Anca D. Dragan,et al.  Should Robots be Obedient? , 2017, IJCAI.

[78]  Mark O. Riedl,et al.  Enter the Matrix: Safely Interruptible Autonomous Systems via Virtualization , 2019, SafeAI@AAAI.

[79]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[80]  Steven I. Marcus,et al.  Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[81]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[82]  Mykel J. Kochenderfer,et al.  Towards Proving the Adversarial Robustness of Deep Neural Networks , 2017, FVAV@iFM.

[83]  J. Doyle,et al.  Essentials of Robust Control , 1997 .