Towards AGI Agent Safety by Iteratively Improving the Utility Function

While it is still unclear if agents with Artificial General Intelligence (AGI) could ever be built, we can already use mathematical models to investigate potential safety systems for these agents. We present work on an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent’s utility function. The humans who switched on the agent can use this terminal to close any loopholes that are discovered in the utility function’s encoding of agent goals and constraints, to direct the agent towards new goals, or to force the agent to switch itself off.

[1]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[2]  Scott Garrabrant,et al.  Embedded Agency , 2019, ArXiv.

[3]  Laurent Orseau,et al.  Measuring and avoiding side effects using relative reachability , 2018, ArXiv.

[4]  Stephen M. Omohundro,et al.  The Basic AI Drives , 2008, AGI.

[5]  Dylan Hadfield-Menell,et al.  Conservative Agency via Attainable Utility Preservation , 2019, AIES.

[6]  Marcus Hutter,et al.  AGI Safety Literature Review , 2018, IJCAI.

[7]  Shane Legg,et al.  The Incentives that Shape Behaviour , 2020, ArXiv.

[8]  Stuart Armstrong,et al.  Motivated Value Selection for Artificial Agents , 2015, AAAI Workshop: AI and Ethics.

[9]  Ramana Kumar,et al.  Modeling AGI Safety Frameworks with Causal Influence Diagrams , 2019, AISafety@IJCAI.

[10]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[11]  Marcus Hutter,et al.  Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective , 2019, Synthese.

[12]  R. Dechter,et al.  Heuristics, Probability and Causality. A Tribute to Judea Pearl , 2010 .

[13]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[14]  Illtyd Trethowan Causality , 1938 .

[15]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[16]  Anca D. Dragan,et al.  The Off-Switch Game , 2016, IJCAI.

[17]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[18]  Koen Holtman,et al.  Corrigibility with Utility Preservation , 2019, ArXiv.

[19]  Stuart Armstrong,et al.  'Indifference' methods for managing agent rewards , 2017, ArXiv.