Learning to Plan Optimistically: Uncertainty-Guided Deep Exploration via Latent Model Ensembles

Learning complex behaviors through interaction requires coordinated long-term planning. Random exploration and novelty search lack task-centric guidance and waste effort on non-informative interactions. Instead, decision making should target samples with the potential to optimize performance far into the future, while only reducing uncertainty where conducive to this objective. This paper presents latent optimistic value exploration (LOVE), a strategy that enables deep exploration through optimism in the face of uncertain long-term rewards. We combine finite horizon rollouts from a latent model with value function estimates to predict infinite horizon returns and recover associated uncertainty through ensembling. Policy training then proceeds on an upper confidence bound (UCB) objective to identify and select the interactions most promising to improve long-term performance. We apply LOVE to visual control tasks in continuous state-action spaces and demonstrate improved sample complexity on a selection of benchmarking tasks.

[1]  Pieter Abbeel,et al.  SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning , 2021, ICML.

[2]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[3]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[5]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[6]  Misha Denil,et al.  Learning Awareness Models , 2018, ICLR.

[7]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[8]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[9]  Sergey Levine,et al.  SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning , 2018, ICML.

[10]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[11]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[12]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[13]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[14]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[15]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning , 2017, ICML.

[16]  Chelsea Finn,et al.  Offline Reinforcement Learning from Images with Latent Space Models , 2020, L4DC.

[17]  Pieter Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[18]  Peter Secretan Learning , 1965, Mental Health.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[21]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[22]  Mikael Henaff,et al.  Explicit Explore-Exploit Algorithms in Continuous State Spaces , 2019, NeurIPS.

[23]  Wojciech Jaskowski,et al.  Model-Based Active Exploration , 2018, ICML.

[24]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[25]  Richard Y. Chen,et al.  UCB EXPLORATION VIA Q-ENSEMBLES , 2018 .

[26]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[27]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[28]  Krzysztof Choromanski,et al.  Ready Policy One: World Building Through Active Learning , 2020, ICML.

[29]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[30]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[31]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[32]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[33]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[34]  Richard Socher,et al.  Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills , 2020, ICML.

[35]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[36]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[37]  Marc Peter Deisenroth,et al.  Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control , 2017, AISTATS.

[38]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[39]  Wilko Schwarting,et al.  Learning to Plan via Deep Optimistic Value Exploration , 2020, L4DC.

[40]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[41]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[42]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[43]  P. Alam ‘E’ , 2021, Composites Engineering: An A–Z Guide.

[44]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[45]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Atil Iscen,et al.  Data Efficient Reinforcement Learning for Legged Robots , 2019, CoRL.

[47]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[48]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[49]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[50]  Joonho Lee,et al.  Learning agile and dynamic motor skills for legged robots , 2019, Science Robotics.

[51]  Tim Pearce,et al.  Uncertainty in Neural Networks: Approximately Bayesian Ensembling , 2018, AISTATS.

[52]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[53]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[54]  Sergey Levine,et al.  Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[55]  Yann LeCun,et al.  Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic , 2019, ICLR.

[56]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[57]  Tamim Asfour,et al.  Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[58]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[59]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[60]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.