TOWARDS MINIMAX OPTIMAL REWARD-FREE REIN-

We study reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, an agent first interacts with the environment without accessing the reward function in the exploration phase. In the subsequent planning phase, it is given a reward function and asked to output an ε-optimal policy. We propose a novel algorithm LSVI-RFE under the linear MDP setting, where the transition probability and reward functions are linear in a feature mapping. We prove an r OpHd{εq sample complexity upper bound for LSVI-RFE, where H is the episode length and d is the feature dimension. We also establish a sample complexity lower bound of ΩpHd{εq. To the best of our knowledge, LSVI-RFE is the first computationally efficient algorithm that achieves the minimax optimal sample complexity in linear MDP settings up to an H and logarithmic factors. Our LSVI-RFE algorithm is based on a novel variance-aware exploration mechanism to avoid overly-conservative exploration in prior works. Our sharp bound relies on the decoupling of UCB bonuses during two phases, and a Bernstein-type self-normalized bound, which remove the extra dependency of sample complexity on H and d, respectively.

[1]  Yu Chen,et al.  Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation , 2022, ICML.

[2]  A. Krishnamurthy,et al.  On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL , 2022, NeurIPS.

[3]  Ric De Santi,et al.  Provably Efficient Causal Model-Based Reinforcement Learning for Systematic Generalization , 2022, AAAI.

[4]  Tie-Yan Liu,et al.  Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality , 2022, ICLR.

[5]  Alekh Agarwal,et al.  Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach , 2022, ICML.

[6]  Kevin G. Jamieson,et al.  First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach , 2021, ICML.

[7]  Pieter Abbeel,et al.  Mastering Atari Games with Limited Data , 2021, NeurIPS.

[8]  Quanquan Gu,et al.  Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation , 2021, NeurIPS.

[9]  Liwei Wang,et al.  Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver , 2021, ICLR.

[10]  V. Braverman,et al.  Gap-Dependent Unsupervised Exploration for Reinforcement Learning , 2021, AISTATS.

[11]  A. Krishnamurthy,et al.  Model-free Representation Learning and Exploration in Low-rank MDPs , 2021, ArXiv.

[12]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints , 2021, NeurIPS.

[13]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[14]  Quanquan Gu,et al.  Logarithmic Regret for Reinforcement Learning with Linear Function Approximation , 2020, ICML.

[15]  Xiangyang Ji,et al.  Nearly Minimax Optimal Reward-free Reinforcement Learning , 2020, ArXiv.

[16]  Mykel J. Kochenderfer,et al.  Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration , 2020, NeurIPS.

[17]  Anders Jonsson,et al.  Fast active learning for pure exploration in reinforcement learning , 2020, ICML.

[18]  Aaron C. Courville,et al.  Data-Efficient Reinforcement Learning with Self-Predictive Representations , 2020, ICLR.

[19]  Ruosong Wang,et al.  On Reward-Free Reinforcement Learning with Linear Function Approximation , 2020, NeurIPS.

[20]  E. Kaufmann,et al.  Adaptive Reward-Free Exploration , 2020, ALT.

[21]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[22]  Lin F. Yang,et al.  Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension , 2020, NeurIPS.

[23]  Yi Wu,et al.  Multi-Task Reinforcement Learning with Soft Modularization , 2020, NeurIPS.

[24]  Mykel J. Kochenderfer,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[25]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[26]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[27]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[28]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[29]  Alessandro Lazaric,et al.  Frequentist Regret Bounds for Randomized Least-Squares Value Iteration , 2019, AISTATS.

[30]  Ambuj Tewari,et al.  Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles , 2019, AISTATS.

[31]  Chelsea Finn,et al.  Language as an Abstraction for Hierarchical Deep Reinforcement Learning , 2019, NeurIPS.

[32]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[33]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[34]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[35]  Nan Jiang,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[36]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[37]  Wojciech Czarnecki,et al.  Multi-task Deep Reinforcement Learning with PopArt , 2018, AAAI.

[38]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[39]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[40]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[41]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[43]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[44]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[45]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.