Risks from Learned Optimization in Advanced Machine Learning Systems

We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be - how will it differ from the loss function it was trained under - and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.

[1]  Allan Jabri,et al.  Universal Planning Networks , 2018, ICML.

[2]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[3]  Shimon Whiteson,et al.  TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning , 2017, ICLR.

[4]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[5]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[6]  Dario Amodei,et al.  Supervising strong learners by amplifying weak experts , 2018, ArXiv.

[7]  Sergiu Hart,et al.  The Absent-Minded Driver , 1996, TARK.

[8]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[9]  Razvan Pascanu,et al.  Learning model-based planning from scratch , 2017, ArXiv.

[10]  Nick Bostrom,et al.  Superintelligence: Paths, Dangers, Strategies , 2014 .

[11]  Scott Garrabrant,et al.  Categorizing Variants of Goodhart's Law , 2018, ArXiv.

[12]  Stuart Armstrong,et al.  Occam's razor is insufficient to infer the preferences of irrational agents , 2017, NeurIPS.

[13]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[14]  Min Wu,et al.  Safety Verification of Deep Neural Networks , 2016, CAV.

[15]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[16]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[17]  Junfeng Yang,et al.  Towards Practical Verification of Machine Learning: The Case of Computer Vision Systems , 2017, ArXiv.

[18]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[19]  Kareem Amin,et al.  Towards Resolving Unidentifiability in Inverse Reinforcement Learning , 2016, ArXiv.

[20]  Kouichi Sakurai,et al.  One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.