论文信息 - Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model

Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model

In this paper we consider the problem of computing an $\epsilon$-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in $O(1)$ time. Given such a DMDP with states $\states$, actions $\actions$, discount factor $\gamma\in(0,1)$, and rewards in range $[0, 1]$ we provide an algorithm which computes an $\epsilon$-optimal policy with probability $1 - \delta$ where {\it both} the run time spent and number of sample taken is upper bounded by \[ O\left[\frac{|\cS||\cA|}{(1-\gamma)^3 \epsilon^2} \log \left(\frac{|\cS||\cA|}{(1-\gamma)\delta \epsilon} \right) \log\left(\frac{1}{(1-\gamma)\epsilon}\right)\right] ~. \] For fixed values of $\epsilon$, this improves upon the previous best known bounds by a factor of $(1 - \gamma)^{-1}$ and matches the sample complexity lower bounds proved in \cite{azar2013minimax} up to logarithmic factors. We also extend our method to computing $\epsilon$-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.

[1] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[2] Yin Tat Lee,et al. Efficient Inverse Maintenance and Faster Algorithms for Linear Programming , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[3] R. Bellman,et al. Dynamic Programming and Markov Processes , 1960 .

[4] Yinyu Ye,et al. A New Complexity Result on Solving the Markov Decision Problem , 2005, Math. Oper. Res..

[5] Dimitri P. Bertsekas,et al. Abstract Dynamic Programming , 2013 .

[6] Yishay Mansour,et al. On the Complexity of Policy Iteration , 1999, UAI.

[7] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[8] Lihong Li,et al. Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[9] Yin Tat Lee,et al. Path Finding Methods for Linear Programming: Solving Linear Programs in Õ(vrank) Iterations and Faster Algorithms for Maximum Flow , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[10] Tor Lattimore,et al. PAC Bounds for Discounted MDPs , 2012, ALT.

[11] George B. Dantzig,et al. Linear programming and extensions , 1965 .

[12] Vivek S. Borkar,et al. Empirical Q-Value Iteration , 2014, Stochastic Systems.

[13] Xian Wu,et al. Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[14] Leslie Pack Kaelbling,et al. On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[15] Peter Bro Miltersen,et al. Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[16] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[17] Bruno Scherrer,et al. Improved and Generalized Upper Bounds on the Complexity of Policy Iteration , 2013, Math. Oper. Res..

[18] P. Tseng. Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[19] Andrew W. Moore,et al. Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[20] Mengdi Wang,et al. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time , 2017, ArXiv.

[21] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[22] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[23] Yinyu Ye,et al. The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..