论文信息 - Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the \textit{sample complexity of exploration} of our algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$. This improves the previously best known result of $\tilde{O}({\frac{SA}{\epsilon^4(1-\gamma)^8}})$ in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of $\epsilon$ as well as $S$ and $A$ except for logarithmic factors.

[1] Tor Lattimore,et al. PAC Bounds for Discounted MDPs , 2012, ALT.

[2] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[3] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[4] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[5] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[6] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[7] Csaba Szepesvári,et al. Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[8] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[9] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[10] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[11] Xian Wu,et al. Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[12] Xian Wu,et al. Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[13] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[14] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[15] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[16] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[17] Hilbert J. Kappen,et al. Speedy Q-Learning , 2011, NIPS.