Learning Intrinsic Rewards as a Bi-Level Optimization Problem

We reinterpret the problem of finding intrinsic rewards in reinforcement learning (RL) as a bilevel optimization problem. Using this interpretation, we can make use of recent advancements in the hyperparameter optimization literature, mainly from Self-Tuning Networks (STN), to learn intrinsic rewards. To facilitate our methods, we introduces a new general conditioning layer: Conditional Layer Normalization (CLN). We evaluate our method on several continuous control benchmarks in the Mujoco physics simulator. On all of these benchmarks, the intrinsic rewards learned on the fly lead to higher final rewards.

[1]  S. Hochreiter,et al.  REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[2]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[3]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[4]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[5]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[6]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[7]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[8]  David Duvenaud,et al.  Stochastic Hyperparameter Optimization through Hypernetworks , 2018, ArXiv.

[9]  Jürgen Schmidhuber,et al.  Exploring the predictable , 2003 .

[10]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[11]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[12]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[13]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[14]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[15]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[16]  Roger B. Grosse,et al.  Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions , 2019, ICLR.

[17]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[18]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[19]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Ruslan Salakhutdinov,et al.  Gated-Attention Readers for Text Comprehension , 2016, ACL.

[22]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[23]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[24]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[25]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[26]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[27]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[28]  Jürgen Schmidhuber,et al.  Reinforcement Learning with Self-Modifying Policies , 1998, Learning to Learn.

[29]  Tom Schaul,et al.  Artificial curiosity for autonomous space exploration , 2011 .

[30]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[31]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  David Carmel,et al.  Exploration Strategies for Model-based Learning in Multi-agent Systems: Exploration Strategies , 1999, Autonomous Agents and Multi-Agent Systems.

[33]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[34]  Marijn F. Stollenga,et al.  Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots , 2017, Artif. Intell..

[35]  Tsendsuren Munkhdalai,et al.  Rapid Adaptation with Conditionally Shifted Neurons , 2017, ICML.

[36]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[37]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[38]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[39]  Jürgen Schmidhuber,et al.  Learning skills from play: Artificial curiosity on a Katana robot arm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[40]  Razvan Pascanu,et al.  Progressive Neural Networks , 2016, ArXiv.

[41]  Yi Sun,et al.  Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments , 2011, AGI.

[42]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[43]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[44]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[45]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.