论文信息 - Tactical Optimism and Pessimism for Deep Reinforcement Learning

Tactical Optimism and Pessimism for Deep Reinforcement Learning

In recent years, deep off-policy actor-critic algorithms have become a dominant approach to reinforcement learning for continuous control. One of the primary drivers of this improved performance is the use of pessimistic value updates to address function approximation errors, which previously led to disappointing performance. However, a direct consequence of pessimism is reduced exploration, running counter to theoretical support for the efficacy of optimism in the face of uncertainty. So which approach is best? In this work, we show that the most effective degree of optimism can vary both across tasks and over the course of learning. Inspired by this insight, we introduce a novel deep actor-critic framework, Tactical Optimistic and Pessimistic (TOP) estimation, which switches between optimistic and pessimistic value learning online. This is achieved by formulating the selection as a multi-arm bandit problem. We show in a series of continuous control tasks that TOP outperforms existing methods which rely on a fixed degree of optimism, setting a new state of the art in challenging pixel-based environments. Since our changes are simple to implement, we believe these insights can easily be incorporated into a multitude of off-policy algorithms.

Michael I. Jordan | Aldo Pacchiano | Jack Parker-Holder | Ted Moskovitz | Michael Arbel

[1] Csaba Szepesvári,et al. Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[2] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[3] Carlos Riquelme,et al. Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates , 2019, NeurIPS.

[4] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[5] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[6] Krzysztof Choromanski,et al. Effective Diversity in Population-Based Reinforcement Learning , 2020, NeurIPS.

[7] Pieter Abbeel,et al. CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[8] Frederick R. Forst,et al. On robust estimation of the location parameter , 1980 .

[9] Sarah Filippi,et al. Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[10] Alessandro Lazaric,et al. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[11] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[12] Ruben Villegas,et al. Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[13] Krzysztof Choromanski,et al. On Optimism in Model-Based Reinforcement Learning , 2020, ArXiv.

[14] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[15] Arthur Gretton,et al. Kernelized Wasserstein Natural Gradient , 2020, ICLR.

[16] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[17] Arthur Gretton,et al. Efficient Wasserstein Natural Gradients for Reinforcement Learning , 2020, ICLR.

[18] Julian Zimmert,et al. Model Selection in Contextual Stochastic Bandit Problems , 2020, NeurIPS.

[19] Bo Liu,et al. QUOTA: The Quantile Option Architecture for Reinforcement Learning , 2018, AAAI.

[20] Mohammad Norouzi,et al. Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[21] Robert Loftin,et al. Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[22] Christos Dimitrakakis,et al. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities , 2019, ArXiv.

[23] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24] Honglak Lee,et al. Predictive Information Accelerates Learning in RL , 2020, NeurIPS.

[25] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[26] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[27] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[28] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[29] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[30] Quoc V. Le,et al. Evolving Reinforcement Learning Algorithms , 2021, ArXiv.

[31] Daniel Guo,et al. Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[32] Philip J. Ball,et al. OffCon3: What is state of the art anyway? , 2021, ArXiv.

[33] Sebastian Thrun,et al. Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[34] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[35] Krzysztof Choromanski,et al. Ready Policy One: World Building Through Active Learning , 2020, ICML.