论文信息 - Deep Reinforcement Learning with Dynamic Optimism

Deep Reinforcement Learning with Dynamic Optimism

In recent years, deep off-policy actor-critic algorithms have become a dominant approach to reinforcement learning for continuous control. This comes after a series of breakthroughs to address function approximation errors, which previously led to poor performance. These insights encourage the use of pessimistic value updates. However, this discourages exploration and runs counter to theoretical support for the efficacy of optimism in the face of uncertainty. So which approach is best? In this work, we show that the optimal degree of optimism can vary both across tasks and over the course of learning. Inspired by this insight, we introduce a novel deep actor-critic algorithm, Dynamic Optimistic and Pessimistic Estimation (DOPE) to switch between optimistic and pessimistic value learning online by formulating the selection as a multi-arm bandit problem. We show in a series of challenging continuous control tasks that DOPE outperforms existing state-ofthe-art methods, which rely on a fixed degree of optimism. Since our changes are simple to implement, we believe these insights can be extended to a number of off-policy algorithms.

[1] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[2] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[3] Carlos Riquelme,et al. Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates , 2019, NeurIPS.

[4] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[5] Krzysztof Choromanski,et al. Ready Policy One: World Building Through Active Learning , 2020, ICML.

[6] Julian Zimmert,et al. Model Selection in Contextual Stochastic Bandit Problems , 2020, NeurIPS.

[7] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[8] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[10] Tom Schaul,et al. Adapting Behaviour for Learning Progress , 2019, ArXiv.

[11] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[12] Robert Loftin,et al. Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[13] Junhyuk Oh,et al. Discovering Reinforcement Learning Algorithms , 2020, NeurIPS.

[14] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[15] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[16] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[17] Christos Dimitrakakis,et al. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities , 2019, ArXiv.

[18] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[19] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[20] Michael I. Jordan,et al. Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.