Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs

The exploration bonus is an effective approach to manage the exploration-exploitation trade-off in Markov Decision Processes (MDPs). While it has been analyzed in infinite-horizon discounted and finite-horizon problems, we focus on designing and analysing the exploration bonus in the more challenging infinite-horizon undiscounted setting. We first introduce SCAL+, a variant of SCAL (Fruit et al. 2018), that uses a suitable exploration bonus to solve any discrete unknown weakly-communicating MDP for which an upper bound $c$ on the span of the optimal bias function is known. We prove that SCAL+ enjoys the same regret guarantees as SCAL, which relies on the less efficient extended value iteration approach. Furthermore, we leverage the flexibility provided by the exploration bonus scheme to generalize SCAL+ to smooth MDPs with continuous state space and discrete actions. We show that the resulting algorithm (SCCAL+) achieves the same regret bound as UCCRL (Ortner and Ryabko, 2012) while being the first implementable algorithm for this setting.

[1]  Alessandro Lazaric,et al.  Regret Minimization in MDPs with Options without Prior Knowledge , 2017, NIPS.

[2]  Ronald Ortner,et al.  Regret Bounds for Reinforcement Learning via Markov Chain Concentration , 2018, J. Artif. Intell. Res..

[3]  Yishay Mansour,et al.  Convergence of Optimistic and Incremental Q-Learning , 2001, NIPS.

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[6]  Ronald Ortner,et al.  Optimism in the Face of Uncertainty Should be Refutable , 2008, Minds and Machines.

[7]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[8]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[9]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[10]  Achim Klenke,et al.  Probability theory - a comprehensive course , 2008, Universitext.

[11]  Ronald Ortner,et al.  Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2015, ICML.

[12]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[13]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[14]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[15]  Ronald Ortner,et al.  Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[16]  Claudio Gentile,et al.  Improved Risk Tail Bounds for On-Line Algorithms , 2005, IEEE Transactions on Information Theory.

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[19]  Marcus Hutter,et al.  Count-Based Exploration in Feature Space for Reinforcement Learning , 2017, IJCAI.

[20]  Michael T. Rosenstein,et al.  Supervised Actor‐Critic Reinforcement Learning , 2012 .

[21]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[22]  H. Teicher,et al.  Probability theory: Independence, interchangeability, martingales , 1978 .

[23]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[24]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[25]  Alessandro Lazaric,et al.  Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[26]  Sham M. Kakade,et al.  Variance Reduction Methods for Sublinear Reinforcement Learning , 2018, ArXiv.

[27]  Alessandro Lazaric,et al.  Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes , 2018, NeurIPS.

[28]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[29]  Mohammad Sadegh Talebi,et al.  Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , 2018, ALT.

[30]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[31]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.