Variance Reduction Methods for Sublinear Reinforcement Learning

This work considers the problem of provably optimal reinforcement learning for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize his/her (long term) reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP, $S$ is the number of states, $A$ is the number of actions, and $H$ is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve $\epsilon$-average regret (for any constant $\epsilon$) is $O(SA)$, which is a number of samples that is far less than that required to learn any (non-trivial) estimate of the transition model (the transition model is specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as $Q$-learning and other "model free" approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the $\Omega(\sqrt{HSAT})$ lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in $Q$-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the (asymptotically) optimal regret.

[1]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[2]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[3]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[4]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[5]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[6]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[7]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[8]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[9]  Tor Lattimore,et al.  Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[10]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[11]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[12]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[13]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[14]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[15]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.