Deep Reinforcement Learning with Dynamic Optimism

In recent years, deep off-policy actor-critic algorithms have become a dominant approach to reinforcement learning for continuous control. This comes after a series of breakthroughs to address function approximation errors, which previously led to poor performance. These insights encourage the use of pessimistic value updates. However, this discourages exploration and runs counter to theoretical support for the efficacy of optimism in the face of uncertainty. So which approach is best? In this work, we show that the optimal degree of optimism can vary both across tasks and over the course of learning. Inspired by this insight, we introduce a novel deep actor-critic algorithm, Dynamic Optimistic and Pessimistic Estimation (DOPE) to switch between optimistic and pessimistic value learning online by formulating the selection as a multi-arm bandit problem. We show in a series of challenging continuous control tasks that DOPE outperforms existing state-ofthe-art methods, which rely on a fixed degree of optimism. Since our changes are simple to implement, we believe these insights can be extended to a number of off-policy algorithms.

[1]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[2]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[3]  Carlos Riquelme,et al.  Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates , 2019, NeurIPS.

[4]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[5]  Krzysztof Choromanski,et al.  Ready Policy One: World Building Through Active Learning , 2020, ICML.

[6]  Julian Zimmert,et al.  Model Selection in Contextual Stochastic Bandit Problems , 2020, NeurIPS.

[7]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[8]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[10]  Tom Schaul,et al.  Adapting Behaviour for Learning Progress , 2019, ArXiv.

[11]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[12]  Robert Loftin,et al.  Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[13]  Junhyuk Oh,et al.  Discovering Reinforcement Learning Algorithms , 2020, NeurIPS.

[14]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[15]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[16]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[17]  Christos Dimitrakakis,et al.  Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities , 2019, ArXiv.

[18]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[19]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[20]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[21]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[22]  Haipeng Luo,et al.  Corralling a Band of Bandit Algorithms , 2016, COLT.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Sarah Filippi,et al.  Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[25]  Alessandro Lazaric,et al.  Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[26]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[27]  Quoc V. Le,et al.  Evolving Reinforcement Learning Algorithms , 2021, ArXiv.

[28]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[29]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[30]  Philip J. Ball,et al.  OffCon3: What is state of the art anyway? , 2021, ArXiv.

[31]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[32]  Michael I. Jordan,et al.  Learning to Score Behaviors for Guided Policy Optimization , 2020, ICML.

[33]  Claudio Gentile,et al.  Regret Bound Balancing and Elimination for Model Selection in Bandits and RL , 2020, ArXiv.

[34]  Krzysztof Choromanski,et al.  Effective Diversity in Population-Based Reinforcement Learning , 2020, NeurIPS.

[35]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[36]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[37]  Krzysztof Choromanski,et al.  On Optimism in Model-Based Reinforcement Learning , 2020, ArXiv.

[38]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[39]  Arthur Gretton,et al.  Efficient Wasserstein Natural Gradients for Reinforcement Learning , 2020, ICLR.

[40]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[41]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[42]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[43]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[44]  Marc G. Bellemare,et al.  Statistics and Samples in Distributional Reinforcement Learning , 2019, ICML.

[45]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.