Towards Safe Artificial General Intelligence

The field of artificial intelligence has recently experienced a number of breakthroughs thanks to progress in deep learning and reinforcement learning. Computer algorithms now outperform humans at Go, Jeopardy, image classification, and lip reading, and are becoming very competent at driving cars and interpreting natural language. The rapid development has led many to conjecture that artificial intelligence with greater-thanhuman ability on a wide range of tasks may not be far. This in turn raises concerns whether we know how to control such systems, in case we were to successfully build them. Indeed, if humanity would find itself in conflict with a system of much greater intelligence than itself, then human society would likely lose. One way to make sure we avoid such a conflict is to ensure that any future AI system with potentially greater-thanhuman-intelligence has goals that are aligned with the goals of the rest of humanity. For example, it should not wish to kill humans or steal their resources. The main focus of this thesis will therefore be goal alignment, i.e. how to design artificially intelligent agents with goals coinciding with the goals of their designers. Focus will mainly be directed towards variants of reinforcement learning, as reinforcement learning currently seems to be the most promising path towards powerful artificial intelligence. We identify and categorize goal misalignment problems in reinforcement learning agents as designed today, and give examples of how these agents may cause catastrophes in the future. We also suggest a number of reasonably modest modifications that can be used to avoid or mitigate each identified misalignment problem. Finally, we also study various choices of decision algorithms, and conditions for when a powerful reinforcement learning system will permit us to shut it down. The central conclusion is that while reinforcement learning systems as designed today are inherently unsafe to scale to human levels of intelligence, there are ways to potentially address many of these issues without straying too far from the currently so successful reinforcement learning paradigm. Much work remains in turning the high-level proposals suggested in this thesis into practical algorithms, however. Central claim: There are a number of theoretically valid, partial solutions to the problem of keeping artificial general intelligence both safe and useful.

[1]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[2]  Nick Bostrom,et al.  Future Progress in Artificial Intelligence: A Survey of Expert Opinion , 2013, PT-AI.

[3]  Toby Walsh The Singularity May Never Be Near , 2017, AI Mag..

[4]  James L Olds,et al.  Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. , 1954, Journal of comparative and physiological psychology.

[5]  Laurent Orseau,et al.  Thompson Sampling is Asymptotically Optimal in General Environments , 2016, UAI.

[6]  Benja Fallenstein,et al.  Problems of Self-reference in Self-improving Space-Time Embedded Intelligence , 2014, AGI.

[7]  Alex Altair,et al.  A Comparison of Decision Algorithms on Newcomblike Problems , 2023, ArXiv.

[8]  Jan Leike,et al.  Nonparametric General Reinforcement Learning , 2016, ArXiv.

[9]  Marcus Hutter,et al.  Bad Universal Priors and Notions of Optimality , 2015, COLT.

[10]  Arif Ahmed,et al.  Evidence, Decision and Causality , 2014 .

[11]  Jon Bird,et al.  The evolved radio and its implications for modelling the evolution of novel sensors , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[12]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[13]  Christos H. Papadimitriou,et al.  Elements of the Theory of Computation , 1997, SIGA.

[14]  M. Schervish,et al.  State-Dependent Utilities , 1990 .

[15]  Kaj Sotala How feasible is the rapid development of artificial superintelligence? , 2017 .

[16]  T. Modis,et al.  Forecasting the Growth of Complexity and Change , 2002 .

[17]  William G. Faris Shadows of the Mind: A Search for the Missing Science of Consciousness , 1997 .

[18]  Marcus Hutter,et al.  Death and Suicide in Universal Artificial Intelligence , 2016, AGI.

[19]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[20]  K. Eric Drexler,et al.  MDL Intelligence Distillation : Exploring Strategies for Safe Access to Superintelligent Problem-Solving Capabilities , 2018 .

[21]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[22]  E. Rasmussen Games and Information , 1989 .

[23]  Jürgen Schmidhuber,et al.  Algorithmic Theories of Everything , 2000, ArXiv.

[24]  N. Goodman Fact, Fiction, and Forecast , 1955 .

[25]  James M. Joyce The Foundations of Causal Decision Theory , 1999 .

[26]  Tor Lattimore,et al.  Asymptotically Optimal Agents , 2011, ALT.

[27]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[28]  Stuart Armstrong,et al.  Motivated Value Selection for Artificial Agents , 2015, AAAI Workshop: AI and Ethics.

[29]  Jianfeng Gao,et al.  Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear , 2016, ArXiv.

[30]  Scott Garrabrant,et al.  Categorizing Variants of Goodhart's Law , 2018, ArXiv.

[31]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[32]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[33]  D. Chalmers The Singularity: a Philosophical Analysis , 2010 .

[34]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[35]  David J. Jilk,et al.  Anthropomorphic reasoning about neuromorphic AGI safety , 2017, J. Exp. Theor. Artif. Intell..

[36]  Marcus Hutter,et al.  A Philosophical Treatise of Universal Induction , 2011, Entropy.

[37]  Benja Fallenstein,et al.  Aligning Superintelligence with Human Interests: A Technical Research Agenda , 2015 .

[38]  Ernest Davis,et al.  Ethical guidelines for a superintelligence , 2015, Artif. Intell..

[39]  Matthias Scheutz,et al.  The “big red button” is too late: an alternative model for the ethical evaluation of AI systems , 2018, Ethics and Information Technology.

[40]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[41]  Clara Neppel AI: Intelligent machines, smart policies , 2018, OECD Digital Economy Papers.

[42]  Marcus Hutter,et al.  Count-Based Exploration in Feature Space for Reinforcement Learning , 2017, IJCAI.

[43]  Nate Soares,et al.  Logical Induction , 2016, Electron. Colloquium Comput. Complex..

[44]  Markus Müller,et al.  Stationary algorithmic probability , 2006, Theor. Comput. Sci..

[45]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[46]  Roman V. Yampolskiy,et al.  Artificial Superintelligence: A Futuristic Approach , 2015 .

[47]  Nick Bostrom,et al.  Racing to the precipice: a model of artificial intelligence development , 2016, AI & SOCIETY.

[48]  Roman V Yampolskiy,et al.  Responses to catastrophic AGI risk: a survey , 2014 .

[49]  Tor Lattimore,et al.  Universal Prediction of Selected Bits , 2011, ALT.

[50]  Laurent Orseau,et al.  Self-Modification and Mortality in Artificial Agents , 2011, AGI.

[51]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[52]  C. Aitken,et al.  The logic of decision , 2014 .

[53]  John R. Searle,et al.  Minds, brains, and programs , 1980, Behavioral and Brain Sciences.

[54]  Anca D. Dragan,et al.  Should Robots be Obedient? , 2017, IJCAI.

[55]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[56]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[57]  Wendy Hall,et al.  Growing the artificial intelligence industry in the UK , 2017 .

[58]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[59]  Jessica Taylor,et al.  A Formal Solution to the Grain of Truth Problem , 2016, UAI.

[60]  Yi-Ling Teo Regulating Artificial Intelligence : An Ethical Approach , 2018 .

[61]  Ming Li,et al.  Average Case Complexity Under the Universal Distribution Equals Worst-Case Complexity , 1992, Inf. Process. Lett..

[62]  William Harper,et al.  Counterfactuals and Two Kinds of Expected Utility , 1978 .

[63]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[64]  Shane Legg,et al.  Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents , 2018, ArXiv.

[65]  Marcus Hutter,et al.  Sequential Extensions of Causal and Evidential Decision Theory , 2015, ADT.

[66]  Nick Bostrom,et al.  Thinking Inside the Box: Controlling and Using an Oracle AI , 2012, Minds and Machines.

[67]  Tommi S. Jaakkola,et al.  A causal framework for explaining the predictions of black-box sequence-to-sequence models , 2017, EMNLP.

[68]  Eliezer Yudkowsky,et al.  Program Equilibrium in the Prisoner ’ s Dilemma via Löb ’ s Theorem , 2014 .

[69]  J. Tenenbaum,et al.  Learning a commonsense moral theory , 2017, Cognition.

[70]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[71]  Laurent Orseau,et al.  Optimality Issues of Universal Greedy Agents with Static Priors , 2010, ALT.

[72]  Iyad Rahwan,et al.  Superintelligence cannot be contained: Lessons from Computability Theory , 2016, J. Artif. Intell. Res..

[73]  Shane Legg,et al.  A Collection of Definitions of Intelligence , 2007, AGI.

[74]  Marcus Hutter,et al.  AGI Safety Literature Review , 2018, IJCAI.

[75]  Dr. Marcus Hutter,et al.  Universal artificial intelligence , 2004 .

[76]  Dawn Song,et al.  Robust Physical-World Attacks on Deep Learning Models , 2017, 1707.08945.

[77]  Marcus Hutter,et al.  Analytical Results on the BFS vs. DFS Algorithm Selection Problem. Part I: Tree Search , 2015, Australasian Conference on Artificial Intelligence.

[78]  Stuart Armstrong,et al.  Impossibility of deducing preferences and rationality from human policy , 2017, NIPS 2018.

[79]  S. Baum A Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy , 2017 .

[80]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[81]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[82]  Vincent Conitzer,et al.  An "Ethical" Game-Theoretic Solution Concept for Two-Player Perfect-Information Games , 2008, WINE.

[83]  Mark O. Riedl,et al.  Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.

[84]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[85]  David A. Rottenberg,et al.  Compulsive thalamic self-stimulation: A case with metabolic, electrophysiologic and behavioral correlates , 1986, Pain.

[86]  James Babcock,et al.  Guidelines for Artificial Intelligence Containment , 2017, Next-Generation Ethics.

[87]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[88]  Stuart Armstrong,et al.  Low Impact Artificial Intelligences , 2017, ArXiv.

[89]  Robin Cohen,et al.  Towards Provably Moral AI Agents in Bottom-up Learning Frameworks , 2018, AAAI Spring Symposia.

[90]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[91]  Ramana Kumar,et al.  Proof-Producing Reflection for HOL - With an Application to Model Polymorphism , 2015, ITP.

[92]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[93]  Raymond C. Kurzweil,et al.  The Singularity Is Near , 2018, The Infinite Desire for Growth.

[94]  Kaj Sotala,et al.  Superintelligence As a Cause or Cure For Risks of Astronomical Suffering , 2017, Informatica.

[95]  Stephen Omohundro,et al.  The Nature of Self-Improving Artificial Intelligence , 2008 .

[96]  Marcus Hutter,et al.  Universal Reinforcement Learning Algorithms: Survey and Experiments , 2017, IJCAI.

[97]  Seán S. ÓhÉigeartaigh,et al.  An AI Race for Strategic Advantage: Rhetoric and Risks , 2017, AIES.

[98]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[99]  John Salvatier,et al.  When Will AI Exceed Human Performance? Evidence from AI Experts , 2017, ArXiv.

[100]  Randy H. Katz,et al.  A Berkeley View of Systems Challenges for AI , 2017, ArXiv.

[101]  Vincent Conitzer,et al.  Moral Decision Making Frameworks for Artificial Intelligence , 2017, ISAIM.

[102]  I. J. Good,et al.  Speculations Concerning the First Ultraintelligent Machine , 1965, Adv. Comput..

[103]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[104]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[105]  J. Lucas Minds, Machines and Gödel , 1961, Philosophy.

[106]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[107]  Jessica Taylor,et al.  Quantilizers: A Safer Alternative to Maximizers for Limited Optimization , 2016, AAAI Workshop: AI, Ethics, and Society.

[108]  Brian Skyrms,et al.  Causal Decision Theory , 1982 .

[109]  Jessica Taylor,et al.  Alignment for Advanced Machine Learning Systems , 2020, Ethics of Artificial Intelligence.

[110]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[111]  Shie Mannor,et al.  Graying the black box: Understanding DQNs , 2016, ICML.

[112]  D. Kahneman Thinking, Fast and Slow , 2011 .

[113]  Robert Nozick,et al.  Newcomb’s Problem and Two Principles of Choice , 1969 .

[114]  Nate Soares,et al.  Functional Decision Theory: A New Theory of Instrumental Rationality , 2017, ArXiv.

[115]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[116]  Ming Li,et al.  Kolmogorov Complexity and its Applications , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[117]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[118]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[119]  Risto Miikkulainen,et al.  The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities , 2018, Artificial Life.

[120]  M. Strathern ‘Improving ratings’: audit in the British University system , 1997, European Review.

[121]  David J. Jilk Conceptual-Linguistic Superintelligence , 2017, Informatica.

[122]  Gabriel Murray Stoic Ethics for Artificial Agents , 2017, Canadian Conference on AI.

[123]  Stephen Wolfram,et al.  A New Kind of Science , 2003, Artificial Life.

[124]  Kee-Eung Kim,et al.  Inverse Reinforcement Learning in Partially Observable Environments , 2009, IJCAI.

[125]  Haim Gaifman,et al.  Reasoning with Limited Resources and Assigning Probabilities to Arithmetical Statements , 2004, Synthese.

[126]  Ryan Carey,et al.  Incorrigibility in the CIRL Framework , 2017, AIES.