AsyncQVI: Asynchronous-Parallel Q-Value Iteration for Reinforcement Learning with Near-Optimal Sample Complexity

In this paper, we propose AsyncQVI, an asynchronous-parallel Q-value iteration for Reinforcement Learning problems. Given such a problem with $|\mathcal{S}|$ states, $|\mathcal{A}|$ actions, and a discounted factor $\gamma\in(0,1)$, AsyncQVI uses memory of size $\mathcal{O}(|\mathcal{S}|)$ and returns an $\varepsilon$-optimal policy with probability at least $1-\delta$ using $$\tilde{\mathcal{O}}\bigg(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^2}\log\Big(\frac{1}{\delta}\Big)\bigg)$$ samples. AsyncQVI is also the first asynchronous-parallel algorithm for reinforcement learning with a convergence rate and a sample complexity. Its sample complexity nearly matches the theoretical lower bound. The relatively low memory footprint and parallel ability of AsyncQVI make it suitable for large-scale applications. In numerical tests, we compare AsyncQVI with four sample-based value iteration methods. The test results show AsyncQVI is highly efficient and achieves linear parallel speedup.

[1]  Wotao Yin,et al.  More Iterations per Second, Same Quality - Why Asynchronous Algorithms may Drastically Outperform Traditional Ones , 2017, ArXiv.

[2]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[3]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning with Linear Transition Models , 2019, ICML 2019.

[4]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[5]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[6]  Mengdi Wang,et al.  Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time , 2017, ArXiv.

[7]  Pieter Abbeel,et al.  Asynchronous Methods for Model-Based Reinforcement Learning , 2019, CoRL.

[8]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[9]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[10]  Wotao Yin,et al.  On Unbounded Delays in Asynchronous Parallel Fixed-Point Algorithms , 2016, J. Sci. Comput..

[11]  Pieter Abbeel,et al.  Accelerated Methods for Deep Reinforcement Learning , 2018, ArXiv.

[12]  Lin F. Yang,et al.  On the Optimality of Sparse Model-Based Planning for Markov Decision Processes , 2019, ArXiv.

[13]  Ming Yan,et al.  ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates , 2015, SIAM J. Sci. Comput..

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[16]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[17]  M. Kosorok,et al.  Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine , 2015 .

[18]  Daniel Kudenko,et al.  Parallel reinforcement learning with linear function approximation , 2007, AAMAS '07.

[19]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[20]  Hamid Reza Feyzmahdavian,et al.  On the convergence rates of asynchronous iterations , 2014, 53rd IEEE Conference on Decision and Control.

[21]  I. Pinelis On inequalities for sums of bounded random variables , 2006, math/0603030.

[22]  John N. Tsitsiklis,et al.  Some aspects of parallel and distributed iterative algorithms - A survey, , 1991, Autom..

[23]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[25]  Vivek S. Borkar,et al.  Empirical Q-Value Iteration , 2014, Stochastic Systems.

[26]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[27]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[28]  Mengdi Wang,et al.  Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time , 2017, 1704.01869.

[29]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[30]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[31]  Stephen Tyree,et al.  Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , 2016, ICLR.

[32]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[33]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[34]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[36]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[37]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[38]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[39]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .