Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$. We propose a model-free algorithm UCB-Advantage and prove that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-Advantage achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].

[1]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[2]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[3]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[4]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[5]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[6]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  Sham M. Kakade,et al.  Variance Reduction Methods for Sublinear Reinforcement Learning , 2018, ArXiv.

[9]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[10]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[11]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[12]  Xiangyang Ji,et al.  Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function , 2019, NeurIPS.

[13]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[14]  Yu Bai,et al.  Provably Efficient Q-Learning with Low Switching Cost , 2019, NeurIPS.

[15]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[16]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[17]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[18]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[19]  Xiaoyu Chen,et al.  Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[20]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[21]  Yi Ouyang,et al.  Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[22]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[25]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[26]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.