Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying -optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an -optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show that this is not possible—there exists a fundamental tradeoff between achieving low regret and identifying an -optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity—yielding a complexity which scales with the suboptimality gaps and the “reachability” of a state. We show that our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.

[1]  Martin J. Wainwright,et al.  Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning , 2021, ArXiv.

[2]  Alexandre Proutiere,et al.  Navigating to the Best Policy in Markov Decision Processes , 2021, NeurIPS.

[3]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[4]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[5]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[6]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[7]  Max Simchowitz,et al.  Task-Optimal Exploration in Linear Dynamical Systems , 2021, ICML.

[8]  Xiangyang Ji,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2021, COLT.

[9]  E. Kaufmann,et al.  Planning in Markov Decision Processes with Gap-Dependent Sample Complexity , 2020, NeurIPS.

[10]  Anders Jonsson,et al.  Fast active learning for pure exploration in reinforcement learning , 2020, ICML.

[11]  Tengyu Ma,et al.  Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap , 2021, COLT.

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[14]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[15]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT 2020.

[16]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[17]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[18]  Ruosong Wang,et al.  Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning? , 2020, ArXiv.

[19]  Alexandre Proutière,et al.  Exploration in Structured Reinforcement Learning , 2018, NeurIPS.

[20]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[21]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[22]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[23]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[24]  Alexandre Proutiere,et al.  Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity , 2020, ArXiv.

[25]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[26]  Xiangyang Ji,et al.  Nearly Minimax Optimal Reward-free Reinforcement Learning , 2020, ArXiv.

[27]  Martin J. Wainwright,et al.  Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis , 2020, SIAM J. Math. Data Sci..

[28]  Yuantao Gu,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[29]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[30]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[31]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[32]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[33]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[34]  Mykel J. Kochenderfer,et al.  Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model , 2019, NeurIPS.

[35]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.