论文信息 - Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for {\em learning} the model, they ignore it when {\em optimizing} the policy. In this paper, we show that ignoring the epistemic uncertainty leads to greedy algorithms that do not explore sufficiently. In turn, we propose a {\em practical optimistic-exploration algorithm} (\alg), which enlarges the input space with {\em hallucinated} inputs that can exert as much control as the {\em epistemic} uncertainty in the model affords. We analyze this setting and construct a general regret bound for well-calibrated models, which is provably sublinear in the case of Gaussian Process models. Based on this theoretical foundation, we show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models. Our experiments demonstrate that optimistic exploration significantly speeds up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.

Felix Berkenkamp | Sebastian Curi | Andreas Krause

[1] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[2] Andreas Krause,et al. Information Directed Sampling and Bandits with Heteroscedastic Noise , 2018, COLT.

[3] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[4] Pieter Abbeel,et al. Model-Augmented Actor-Critic: Backpropagating through Paths , 2020, ICLR.

[5] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[6] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[7] Michal Valko,et al. Regret Bounds for Kernel-Based Reinforcement Learning , 2020, ArXiv.

[8] T. Blumensath,et al. Theory and Applications , 2011 .

[9] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[10] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[11] Dirk P. Kroese,et al. The cross-entropy method for estimation , 2013 .

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Charles Blundell,et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[14] Aditya Gopalan,et al. On Kernelized Multi-armed Bandits , 2017, ICML.

[15] Gergely Neu,et al. A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[16] D. Jacobson. New second-order and first-order algorithms for determining optimal control: A differential dynamic programming approach , 1968 .

[17] Akshay Krishnamurthy,et al. Information Theoretic Regret Bounds for Online Nonlinear Control , 2020, NeurIPS.

[18] Sergey Levine,et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[19] Yasin Abbasi-Yadkori,et al. Thompson Sampling and Approximate Inference , 2019, NeurIPS.

[20] Marc Peter Deisenroth,et al. Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control , 2017, AISTATS.

[21] Yuandong Tian,et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[22] Benjamin Van Roy,et al. Ensemble Sampling , 2017, NIPS.

[23] Dirk P. Kroese,et al. Chapter 3 – The Cross-Entropy Method for Optimization , 2013 .

[24] Csaba Szepesvári,et al. Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[25] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[26] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[27] Zi Wang,et al. Batched Large-scale Bayesian Optimization in High-dimensional Spaces , 2017, AISTATS.

[28] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[29] E. Todorov,et al. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[30] Andreas Krause,et al. Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[31] Jan Peters,et al. Model-based Lookahead Reinforcement Learning , 2019, ArXiv.

[32] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[33] A. Kiureghian,et al. Aleatory or epistemic? Does it matter? , 2009 .

[34] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[35] Michael Figurnov,et al. Monte Carlo Gradient Estimation in Machine Learning , 2019, J. Mach. Learn. Res..

[36] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[37] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[39] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[40] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[41] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[42] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[43] Carl E. Rasmussen,et al. PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos , 2019, ICML.

[44] Andreas Krause,et al. Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features , 2018, NeurIPS.

[45] C. Rasmussen,et al. Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[46] Jonathan P. How,et al. Robust variable horizon model predictive control for vehicle maneuvering , 2006 .

[47] Michael I. Jordan,et al. Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[48] Martial Hebert,et al. Improved Learning of Dynamics Models for Control , 2016, ISER.

[49] Jay H. Lee,et al. Model predictive control: past, present and future , 1999 .

[50] Il Memming Park,et al. BLACK BOX VARIATIONAL INFERENCE FOR STATE SPACE MODELS , 2015, 1511.07367.