Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for {\em learning} the model, they ignore it when {\em optimizing} the policy. In this paper, we show that ignoring the epistemic uncertainty leads to greedy algorithms that do not explore sufficiently. In turn, we propose a {\em practical optimistic-exploration algorithm} (\alg), which enlarges the input space with {\em hallucinated} inputs that can exert as much control as the {\em epistemic} uncertainty in the model affords. We analyze this setting and construct a general regret bound for well-calibrated models, which is provably sublinear in the case of Gaussian Process models. Based on this theoretical foundation, we show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models. Our experiments demonstrate that optimistic exploration significantly speeds up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.

[1]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[2]  Andreas Krause,et al.  Information Directed Sampling and Bandits with Heteroscedastic Noise , 2018, COLT.

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Pieter Abbeel,et al.  Model-Augmented Actor-Critic: Backpropagating through Paths , 2020, ICLR.

[5]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[6]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[7]  Michal Valko,et al.  Regret Bounds for Kernel-Based Reinforcement Learning , 2020, ArXiv.

[8]  T. Blumensath,et al.  Theory and Applications , 2011 .

[9]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[10]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[11]  Dirk P. Kroese,et al.  The cross-entropy method for estimation , 2013 .

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[14]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[15]  Gergely Neu,et al.  A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[16]  D. Jacobson New second-order and first-order algorithms for determining optimal control: A differential dynamic programming approach , 1968 .

[17]  Akshay Krishnamurthy,et al.  Information Theoretic Regret Bounds for Online Nonlinear Control , 2020, NeurIPS.

[18]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[19]  Yasin Abbasi-Yadkori,et al.  Thompson Sampling and Approximate Inference , 2019, NeurIPS.

[20]  Marc Peter Deisenroth,et al.  Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control , 2017, AISTATS.

[21]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[22]  Benjamin Van Roy,et al.  Ensemble Sampling , 2017, NIPS.

[23]  Dirk P. Kroese,et al.  Chapter 3 – The Cross-Entropy Method for Optimization , 2013 .

[24]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[25]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[26]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[27]  Zi Wang,et al.  Batched Large-scale Bayesian Optimization in High-dimensional Spaces , 2017, AISTATS.

[28]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[29]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[30]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[31]  Jan Peters,et al.  Model-based Lookahead Reinforcement Learning , 2019, ArXiv.

[32]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[33]  A. Kiureghian,et al.  Aleatory or epistemic? Does it matter? , 2009 .

[34]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[35]  Michael Figurnov,et al.  Monte Carlo Gradient Estimation in Machine Learning , 2019, J. Mach. Learn. Res..

[36]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[37]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[39]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[40]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[41]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[42]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[43]  Carl E. Rasmussen,et al.  PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos , 2019, ICML.

[44]  Andreas Krause,et al.  Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features , 2018, NeurIPS.

[45]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[46]  Jonathan P. How,et al.  Robust variable horizon model predictive control for vehicle maneuvering , 2006 .

[47]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[48]  Martial Hebert,et al.  Improved Learning of Dynamics Models for Control , 2016, ISER.

[49]  Jay H. Lee,et al.  Model predictive control: past, present and future , 1999 .

[50]  Il Memming Park,et al.  BLACK BOX VARIATIONAL INFERENCE FOR STATE SPACE MODELS , 2015, 1511.07367.

[51]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[52]  Sergey Levine,et al.  Optimism-driven exploration for nonlinear systems , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[53]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[54]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[55]  Stefano Ermon,et al.  Accurate Uncertainties for Deep Learning Using Calibrated Regression , 2018, ICML.

[56]  Andreas Krause,et al.  Structured Variational Inference in Unstable Gaussian Process State Space Models , 2019, ArXiv.

[57]  Alessandro Lazaric,et al.  Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation , 2020, ICML.

[58]  Lukas Hewing,et al.  On Simulation and Trajectory Prediction with Gaussian Process Dynamics , 2020, L4DC.

[59]  Aditya Gopalan,et al.  Online Learning in Kernelized Markov Decision Processes , 2019, AISTATS.

[60]  Volkan Cevher,et al.  Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization , 2017, COLT.

[61]  Duy Nguyen-Tuong,et al.  Probabilistic Recurrent State-Space Models , 2018, ICML.

[62]  Emanuel Todorov,et al.  Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems , 2004, ICINCO.

[63]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[64]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[65]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[66]  Tamim Asfour,et al.  Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[67]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[68]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[69]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[70]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Felix Berkenkamp,et al.  Safe Exploration in Reinforcement Learning: Theory and Applications in Robotics , 2019 .

[72]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[73]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[74]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[75]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[76]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[77]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[78]  Andreas Krause,et al.  No-regret Bayesian Optimization with Unknown Hyperparameters , 2019, J. Mach. Learn. Res..

[79]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[80]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[81]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[82]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[83]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[84]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[85]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[86]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[87]  Sandra Hirche,et al.  Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control , 2019, NeurIPS.

[88]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[89]  Stefano Ermon,et al.  Calibrated Model-Based Deep Reinforcement Learning , 2019, ICML.

[90]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[91]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[92]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[93]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[94]  Dino Sejdinovic,et al.  Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences , 2018, ArXiv.

[95]  James M. Rehg,et al.  Aggressive driving with model predictive path integral control , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[96]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[97]  Adam D. Bull,et al.  Convergence Rates of Efficient Global Optimization Algorithms , 2011, J. Mach. Learn. Res..

[98]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[99]  Benjamin Recht,et al.  Certainty Equivalence is Efficient for Linear Quadratic Control , 2019, NeurIPS.