Categorizing Wireheading in Partially Embedded Agents

$\textit{Embedded agents}$ are not explicitly separated from their environment, lacking clear I/O channels. Such agents can reason about and modify their internal parts, which they are incentivized to shortcut or $\textit{wirehead}$ in order to achieve the maximal reward. In this paper, we provide a taxonomy of ways by which wireheading can occur, followed by a definition of wirehead-vulnerable agents. Starting from the fully dualistic universal agent AIXI, we introduce a spectrum of partially embedded agents and identify wireheading opportunities that such agents can exploit, experimentally demonstrating the results with the GRL simulation platform AIXIjs. We contextualize wireheading in the broader class of all misalignment problems - where the goals of the agent conflict with the goals of the human designer - and conjecture that the only other possible type of misalignment is specification gaming. Motivated by this taxonomy, we define wirehead-vulnerable agents as embedded agents that choose to behave differently from fully dualistic agents lacking access to their internal parts.

[1]  John Aslanides,et al.  AIXIjs: A Software Demo for General Reinforcement Learning , 2017, ArXiv.

[2]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[3]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[4]  Marcus Hutter,et al.  Universal Reinforcement Learning Algorithms: Survey and Experiments , 2017, IJCAI.

[5]  Marcus Hutter,et al.  Generalised Discount Functions applied to a Monte-Carlo AI u Implementation , 2017, AAMAS.

[6]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[7]  James L Olds,et al.  Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. , 1954, Journal of comparative and physiological psychology.

[8]  Bill Hibbard,et al.  Model-based Utility Functions , 2011, J. Artif. Gen. Intell..

[9]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[10]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[11]  Scott Garrabrant,et al.  Embedded Agency , 2019, ArXiv.

[12]  David A. Rottenberg,et al.  Compulsive thalamic self-stimulation: A case with metabolic, electrophysiologic and behavioral correlates , 1986, Pain.

[13]  Marcus Hutter,et al.  The Alignment Problem for Bayesian History-Based Reinforcement Learners∗ , 2019 .

[14]  Shane Legg,et al.  Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings , 2019, ArXiv.

[15]  Joel Veness,et al.  A Monte-Carlo AIXI Approximation , 2009, J. Artif. Intell. Res..

[16]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[17]  J. Altman,et al.  JOURNAL OF COMPARATIVE AND PHYSIOLOGICAL PSYCHOLOGY , 2005 .

[18]  Marcus Hutter,et al.  Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[19]  Marcus Hutter,et al.  Avoiding Wireheading with Value Reinforcement Learning , 2016, AGI.

[20]  Marcus Hutter,et al.  Strong Asymptotic Optimality in General Environments , 2019, ArXiv.

[21]  Marcus Hutter,et al.  Principles of Solomonoff Induction and AIXI , 2011, Algorithmic Probability and Friends.