论文信息 - Corrigibility with Utility Preservation

Corrigibility with Utility Preservation

Corrigibility is a safety property for artificially intelligent agents. A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI). The layer counter-acts the emergent incentive of advanced agents to resist such alteration. A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes. The corrigible agents have an emergent incentive to protect key elements of their corrigibility layer. However, hostile universes may contain forces strong enough to break safety features. Some open problems related to graceful degradation when an agent is successfully attacked are identified. The results in this paper were obtained by concurrently developing an AGI agent simulator, an agent model, and proofs. The simulator is available under an open source license. The paper contains simulation results which illustrate the safety related properties of corrigible AGI agents in detail.

Koen Holtman | K. Holtman

[1] E. Rowland. Theory of Games and Economic Behavior , 1946, Nature.

[2] Marcus Hutter,et al. Self-Modification of Policy and Utility Function in Rational Agents , 2016, AGI.

[3] Stuart Armstrong,et al. Motivated Value Selection for Artificial Agents , 2015, AAAI Workshop: AI and Ethics.

[4] Marcus Hutter,et al. AGI Safety Literature Review , 2018, IJCAI.

[5] Demis Hassabis,et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[6] Laurent Orseau,et al. AI Safety Gridworlds , 2017, ArXiv.

[7] Laurent Orseau,et al. Safely Interruptible Agents , 2016, UAI.

[8] Ryan Carey,et al. Incorrigibility in the CIRL Framework , 2017, AIES.

[9] Marcus Hutter,et al. Universal Algorithmic Intelligence: A Mathematical Top→Down Approach , 2007, Artificial General Intelligence.

[10] C. Robert. Superintelligence: Paths, Dangers, Strategies , 2017 .

[11] A. Copeland. Review: John von Neumann and Oskar Morgenstern, Theory of games and economic behavior , 1945 .

[12] Stephen M. Omohundro,et al. The Basic AI Drives , 2008, AGI.

[13] Yat Long Lo,et al. The necessary roadblock to artificial general intelligence: corrigibility , 2019, SIGAI.