Adaptive Reward-Free Exploration

Reward-free exploration is a reinforcement learning setting recently studied by Jin et al., who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994 [11], originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs O (SAH 4 /e 2) log(1/δ) episodes to output, with probability 1 − δ, an e-approximation of the optimal policy for any reward function. We empirically compare it to oracle strategies using a generative model.

[1]  E. Kaufmann,et al.  Planning in Markov Decision Processes with Gap-Dependent Sample Complexity , 2020, NeurIPS.

[2]  Xuezhou Zhang,et al.  Task-agnostic Exploration in Reinforcement Learning , 2020, NeurIPS.

[3]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[4]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[5]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[6]  Haim Kaplan,et al.  Near-optimal Regret Bounds for Stochastic Shortest Path , 2020, ICML.

[7]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[8]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[9]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[10]  T. Lai,et al.  SELF-NORMALIZED PROCESSES: EXPONENTIAL INEQUALITIES, MOMENT BOUNDS AND ITERATED LOGARITHM LAWS , 2004, math/0410102.

[11]  Ronald Ortner,et al.  Autonomous exploration for navigating in non-stationary CMPs , 2019, ArXiv.

[12]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[13]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[14]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[15]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[16]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[17]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[18]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[19]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[20]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[21]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[22]  Keyan Zahedi,et al.  Information Theoretically Aided Reinforcement Learning for Embodied Agents , 2016, ArXiv.

[23]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[24]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[25]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[26]  Sarah Filippi,et al.  Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[27]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[28]  Ruosong Wang,et al.  On Reward-Free Reinforcement Learning with Linear Function Approximation , 2020, NeurIPS.

[29]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[30]  Alessandro Lazaric,et al.  No-Regret Exploration in Goal-Oriented Reinforcement Learning , 2020, ICML.

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[33]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[34]  Peter Auer,et al.  Autonomous Exploration For Navigating In MDPs , 2012, COLT.