论文信息 - Towards AGI Agent Safety by Iteratively Improving the Utility Function

Towards AGI Agent Safety by Iteratively Improving the Utility Function

While it is still unclear if agents with Artificial General Intelligence (AGI) could ever be built, we can already use mathematical models to investigate potential safety systems for these agents. We present work on an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent’s utility function. The humans who switched on the agent can use this terminal to close any loopholes that are discovered in the utility function’s encoding of agent goals and constraints, to direct the agent towards new goals, or to force the agent to switch itself off.

Koen Holtman | K. Holtman

[1] Laurent Orseau,et al. AI Safety Gridworlds , 2017, ArXiv.

[2] Scott Garrabrant,et al. Embedded Agency , 2019, ArXiv.

[3] Laurent Orseau,et al. Measuring and avoiding side effects using relative reachability , 2018, ArXiv.

[4] Stephen M. Omohundro,et al. The Basic AI Drives , 2008, AGI.

[5] Dylan Hadfield-Menell,et al. Conservative Agency via Attainable Utility Preservation , 2019, AIES.

[6] Marcus Hutter,et al. AGI Safety Literature Review , 2018, IJCAI.

[7] Shane Legg,et al. The Incentives that Shape Behaviour , 2020, ArXiv.

[8] Stuart Armstrong,et al. Motivated Value Selection for Artificial Agents , 2015, AAAI Workshop: AI and Ethics.

[9] Ramana Kumar,et al. Modeling AGI Safety Frameworks with Causal Influence Diagrams , 2019, AISafety@IJCAI.

[10] John Schulman,et al. Concrete Problems in AI Safety , 2016, ArXiv.

[11] Marcus Hutter,et al. Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective , 2019, Synthese.