Solving Ergodic Markov Decision Processes and Perfect Information Zero-sum Stochastic Games by Variance Reduced Deflated Value Iteration

Recently, Sidford, Wang, Wu and Ye (2018) developed an algorithm combining variance reduction techniques with value iteration to solve discounted Markov decision processes. This algorithm has a sublinear complexity when the discount factor is fixed. Here, we extend this approach to mean-payoff problems, including both Markov decision processes and perfect information zero-sum stochastic games. We obtain sublinear complexity bounds, assuming there is a distinguished state which is accessible from all initial states and for all policies. Our method is based on a reduction from the mean payoff problem to the discounted problem by a Doob h-transform, combined with a deflation technique. The complexity analysis of this algorithm uses at the same time the techniques developed by Sidford et al. in the discounted case and non-linear spectral theory techniques (Collatz-Wielandt characterization of the eigenvalue).

[1]  S. Gaubert,et al.  Policy iteration for perfect information stochastic mean payoff games with bounded first return times is strongly polynomial , 2013, 1310.4953.

[2]  Yu. S. Ledyaev,et al.  Nonsmooth analysis and control theory , 1998 .

[3]  S. Lippman,et al.  Stochastic Games with Perfect Information and Time Average Payoff , 1969 .

[4]  S. Gaubert,et al.  A Collatz-Wielandt characterization of the spectral radius of order-preserving homogeneous maps on cones , 2011, 1112.5968.

[5]  Mengdi Wang,et al.  Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems , 2017, ArXiv.

[6]  Peter W. Glynn,et al.  An empirical algorithm for relative value iteration for average-cost MDPs , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[7]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[8]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[9]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[10]  John Mallet-Paret,et al.  Eigenvalues for a class of homogeneous cone maps arising from max-plus operators , 2002 .

[11]  Sylvain Sorin,et al.  Stochastic Games and Applications , 2003 .

[12]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[13]  Peter Whittle,et al.  Optimization Over Time , 1982 .

[14]  S. Gaubert,et al.  Generic uniqueness of the bias vector of finite stochastic games with perfect information , 2016, 1610.09651.

[15]  E. Dynkin BOUNDARY THEORY OF MARKOV PROCESSES (THE DISCRETE CASE) , 1969 .

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .