Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time

We propose a novel randomized linear programming algorithm for approximating the optimal policy of the discounted Markov decision problem. By leveraging the value-policy duality and binary-tree data structures, the algorithm adaptively samples state-action-state transitions and makes exponentiated primal-dual updates. We show that it finds an $\epsilon$-optimal policy using nearly-linear run time in the worst case. When the Markov decision process is ergodic and specified in some special data formats, the algorithm finds an $\epsilon$-optimal policy using run time linear in the total number of state-action pairs, which is sublinear in the input size. These results provide a new venue and complexity benchmarks for solving stochastic dynamic programs.

[1]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[2]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[3]  George B. Dantzig,et al.  Linear programming and extensions , 1965 .

[4]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[5]  Dimitri P. Bertsekas,et al.  Abstract Dynamic Programming , 2013 .

[6]  Chak-Kuen Wong,et al.  An Efficient Method for Weighted Sampling Without Replacement , 1980, SIAM J. Comput..

[7]  Yishay Mansour,et al.  On the Complexity of Policy Iteration , 1999, UAI.

[8]  R. Rubinstein,et al.  An Efficient Stochastic Approximation Algorithm for Stochastic Saddle Point Problems , 2005 .

[9]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[10]  Mengdi Wang,et al.  Lower Bound On the Computational Complexity of Discounted Markov Decision Problems , 2017, ArXiv.

[11]  Mengdi Wang,et al.  Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[12]  Yin Tat Lee,et al.  Efficient Inverse Maintenance and Faster Algorithms for Linear Programming , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[13]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[14]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[15]  A. Juditsky,et al.  Solving variational inequalities with Stochastic Mirror-Prox algorithm , 2008, 0809.0815.

[16]  F. d'Epenoux,et al.  A Probabilistic Production and Inventory Problem , 1963 .

[17]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[18]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[19]  Bruno Scherrer,et al.  Improved and Generalized Upper Bounds on the Complexity of Policy Iteration , 2013, Math. Oper. Res..

[20]  P. Tseng Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[21]  Yin Tat Lee,et al.  Path Finding Methods for Linear Programming: Solving Linear Programs in Õ(vrank) Iterations and Faster Algorithms for Maximum Flow , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[22]  Yinyu Ye,et al.  A New Complexity Result on Solving the Markov Decision Problem , 2005, Math. Oper. Res..

[23]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[24]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[25]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[26]  Mengdi Wang,et al.  An online primal-dual method for discounted Markov decision processes , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[27]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[28]  Konstantinos Panagiotou,et al.  Efficient Sampling Methods for Discrete Distributions , 2012, ICALP.

[29]  Eugene A. Feinberg,et al.  The value iteration algorithm is not strongly polynomial for discounted dynamic programming , 2013, Oper. Res. Lett..

[30]  David P. Woodruff,et al.  Sublinear Optimization for Machine Learning , 2010, FOCS.

[31]  Yinyu Ye,et al.  The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..