Modeling AGI Safety Frameworks with Causal Influence Diagrams

Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of the proposed system should be trained and interact with each other. In this paper, we model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework. The unified representation permits easy comparison of frameworks and their assumptions. We hope that the diagrams will serve as an accessible and visual introduction to the main AGI safety frameworks.

[1]  Jürgen Schmidhuber,et al.  Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements , 2003, ArXiv.

[2]  Shane Legg,et al.  Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings , 2019, ArXiv.

[3]  Marcus Hutter,et al.  AGI Safety Literature Review , 2018, IJCAI.

[4]  Tom Everitt,et al.  Towards Safe Artificial General Intelligence , 2018 .

[5]  Dario Amodei,et al.  Supervising strong learners by amplifying weak experts , 2018, ArXiv.

[6]  Judea Pearl,et al.  Probabilistic Evaluation of Counterfactual Queries , 1994, AAAI.

[7]  Stuart Armstrong,et al.  Good and safe uses of AI Oracles , 2017, ArXiv.

[8]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[9]  Ronald A. Howard,et al.  Influence Diagrams , 2005, Decis. Anal..

[10]  Daphne Koller,et al.  Multi-Agent Influence Diagrams for Representing and Solving Games , 2001, IJCAI.

[11]  Laurent Orseau,et al.  Agents and Devices: A Relative Definition of Agency , 2018, ArXiv.

[12]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[13]  S. Brison The Intentional Stance , 1989 .

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[16]  Erik P. Hoel,et al.  Quantifying causal emergence shows that macro can beat micro , 2013, Proceedings of the National Academy of Sciences.

[17]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[18]  Marcus Hutter,et al.  Self-Modification of Policy and Utility Function in Rational Agents , 2016, AGI.

[19]  Bill Hibbard,et al.  Model-based Utility Functions , 2011, J. Artif. Gen. Intell..

[20]  Laurent Orseau,et al.  Self-Modification and Mortality in Artificial Agents , 2011, AGI.

[21]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[22]  Nick Bostrom,et al.  Superintelligence: Paths, Dangers, Strategies , 2014 .

[23]  Dario Amodei,et al.  AI safety via debate , 2018, ArXiv.