Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation

Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks and yet, they perform well in many others. In fact, in practice, they are often selected as the top choices, due to their simplicity. But, for what tasks do such policies succeed? Can we give theoretical guarantees for their favorable performance? These crucial questions have been scarcely investigated, despite the prominent practical importance of these policies. This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration. Our results apply to value-function-based algorithms in episodic MDPs with bounded Bellman Eluder dimension. We propose a new complexity measure called myopic exploration gap, denoted by alpha, that captures a structural property of the MDP, the exploration policy and the given value function class. We show that the sample-complexity of myopic exploration scales quadratically with the inverse of this quantity, 1 / alpha^2. We further demonstrate through concrete examples that myopic exploration gap is indeed favorable in several tasks where myopic exploration succeeds, due to the corresponding dynamics and reward structure.

[1]  M. Mohri,et al.  A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning , 2022, NeurIPS.

[2]  Doina Precup,et al.  On the Expressivity of Markov Reward , 2021, NeurIPS.

[3]  Tong Zhang,et al.  Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning , 2021, SIAM Journal on Mathematics of Data Science.

[4]  Kevin G. Jamieson,et al.  Beyond No Regret: Instance-Dependent PAC Reinforcement Learning , 2021, COLT.

[5]  Masatoshi Uehara,et al.  Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage , 2021, ICLR.

[6]  Julian Zimmert,et al.  Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning , 2021, NeurIPS.

[7]  Shachar Lovett,et al.  Bilinear Classes: A Structural Framework for Provable Generalization in RL , 2021, ICML.

[8]  Yujing Hu,et al.  Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping , 2020, NeurIPS.

[9]  David Simchi-Levi,et al.  Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective , 2020, COLT.

[10]  Sheila A. McIlraith,et al.  Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning , 2020, J. Artif. Intell. Res..

[11]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[12]  Georg Ostrovski,et al.  Temporally-Extended {\epsilon}-Greedy Exploration , 2020, 2006.01782.

[13]  Lin F. Yang,et al.  Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension , 2020, NeurIPS.

[14]  Dylan J. Foster,et al.  Naive Exploration is Optimal for Online LQR , 2020, ICML.

[15]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[16]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[17]  Benjamin Recht,et al.  Certainty Equivalence is Efficient for Linear Quadratic Control , 2019, NeurIPS.

[18]  J. Langford,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[19]  Jon D. McAuliffe,et al.  Time-uniform, nonparametric, nonasymptotic confidence sequences , 2018, The Annals of Statistics.

[20]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[21]  Yao Liu,et al.  When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms , 2018, ArXiv.

[22]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[23]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[24]  Marek Grzes,et al.  Reward Shaping in Episodic Reinforcement Learning , 2017, AAMAS.

[25]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[26]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[27]  Nan Jiang,et al.  On Structural Properties of MDPs that Bound Loss Due to Shallow Planning , 2016, IJCAI.

[28]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[29]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[30]  Marc G. Bellemare,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Karthik Sridharan,et al.  Online Nonparametric Regression with General Loss Functions , 2015, ArXiv.

[32]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[33]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[34]  Bhaskara Marthi,et al.  Automatic shaping and decomposition of reward functions , 2007, ICML '07.

[35]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[36]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[37]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[38]  Gerald DeJong,et al.  The Influence of Reward on the Speed of Reinforcement Learning: An Analysis of Shaping , 2003, ICML.

[39]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[40]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[41]  A. Singla,et al.  Explicable Reward Design for Reinforcement Learning Agents , 2021, NeurIPS.

[42]  N. Sanghi Model-Free Approaches , 2021 .