Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the"worst-case"instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an"easy"instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the"instance-dependent"complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the"directions"most relevant to learning a near-optimal policy, and may be of independent interest.

[1]  Kevin G. Jamieson,et al.  Instance-optimal PAC Algorithms for Contextual Bandits , 2022, NeurIPS.

[2]  A. Krause,et al.  Active Exploration via Experiment Design in Markov Chains , 2022, AISTATS.

[3]  Aymen Al Marjani,et al.  Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs , 2022, NeurIPS.

[4]  Kevin G. Jamieson,et al.  Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes , 2022, ICML.

[5]  Dylan J. Foster,et al.  The Statistical Complexity of Interactive Decision Making , 2021, ArXiv.

[6]  Kevin G. Jamieson,et al.  First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach , 2021, ICML.

[7]  Prateek Jain,et al.  Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs , 2021, ICLR.

[8]  Kevin G. Jamieson,et al.  Beyond No Regret: Instance-Dependent PAC Reinforcement Learning , 2021, COLT.

[9]  Julian Zimmert,et al.  Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning , 2021, NeurIPS.

[10]  Alexandre Proutiere,et al.  Navigating to the Best Policy in Markov Decision Processes , 2021, NeurIPS.

[11]  Satinder Singh,et al.  Reward is enough for convex MDPs , 2021, NeurIPS.

[12]  Sham M. Kakade,et al.  An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap , 2021, NeurIPS.

[13]  Shachar Lovett,et al.  Bilinear Classes: A Structural Framework for Provable Generalization in RL , 2021, ICML.

[14]  Max Simchowitz,et al.  Task-Optimal Exploration in Linear Dynamical Systems , 2021, ICML.

[15]  Tengyu Ma,et al.  Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap , 2021, COLT.

[16]  Chi Jin,et al.  Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms , 2021, NeurIPS.

[17]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints , 2021, NeurIPS.

[18]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[19]  Quanquan Gu,et al.  Logarithmic Regret for Reinforcement Learning with Linear Function Approximation , 2020, ICML.

[20]  Csaba Szepesvari,et al.  Online Sparse Reinforcement Learning , 2020, AISTATS.

[21]  Vahab Mirrokni,et al.  Optimal Approximation - Smoothness Tradeoffs for Soft-Max Functions , 2020, NeurIPS.

[22]  Csaba Szepesv'ari,et al.  Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable Optimal Action-Value Functions , 2020, ALT.

[23]  S. Du,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2020, COLT.

[24]  Alexandre Proutiere,et al.  Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity , 2020, ArXiv.

[25]  Mykel J. Kochenderfer,et al.  Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration , 2020, NeurIPS.

[26]  Anders Jonsson,et al.  Fast active learning for pure exploration in reinforcement learning , 2020, ICML.

[27]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[28]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[29]  E. Kaufmann,et al.  Planning in Markov Decision Processes with Gap-Dependent Sample Complexity , 2020, NeurIPS.

[30]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[31]  Mykel J. Kochenderfer,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[32]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[33]  Alessandro Lazaric,et al.  Frequentist Regret Bounds for Randomized Least-Squares Value Iteration , 2019, AISTATS.

[34]  Lin F. Yang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2019, ICLR.

[35]  Lalit Jain,et al.  Sequential Experimental Design for Transductive Linear Bandits , 2019, NeurIPS.

[36]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[37]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[38]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[39]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[40]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[41]  Alexandre Proutière,et al.  Exploration in Structured Reinforcement Learning , 2018, NeurIPS.

[42]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[43]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[44]  Alessandro Lazaric,et al.  Best-Arm Identification in Linear Bandits , 2014, NIPS.

[45]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[46]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[47]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[48]  Francisco S. Melo,et al.  Q -Learning with Linear Function Approximation , 2007, COLT.

[49]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[50]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[51]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[52]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[53]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[54]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[55]  Xiangyang Ji,et al.  Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP , 2021, ArXiv.

[56]  Mykel J. Kochenderfer,et al.  Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model , 2019, NeurIPS.

[57]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[58]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[59]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .