A First-Order Approach to Accelerated Value Iteration

Markov decision processes (MDPs) are used to model stochastic systems in many applications. Several efficient algorithms to compute optimal policies have been studied in the literature, including value iteration (VI), policy iteration and LP-based algorithms. However, these do not scale well especially when the discount factor for the infinite horizon discounted reward, $\lambda$, gets close to one. In particular, the running time scales as $1/(1-\lambda)$ for these algorithms. Our main contribution in this paper is to present algorithms for policy computation that significantly outperform the current approaches. In particular, we present a connection between VI and gradient descent and adapt the ideas of acceleration and momentum in convex optimization to design faster algorithms for MDPs. We prove theoretical guarantees of faster convergence of our algorithms for the computation of the value vector of a policy. We show that the running time scales as $1/\sqrt{1-\lambda}$ for the case of value vector computation, compared to $1/(1-\lambda)$ in the current approaches. The improvement is quite analogous to Nesterov's acceleration and momentum in convex optimization. While the theoretical guarantees do not extend to the case of optimal policy computation, our algorithms exhibit strong empirical performances, providing significant speedups (up to one order of magnitude in many cases) for a large testbed of MDP instances. Finally, we provide a lower-bound on the convergence properties of any first-order algorithm for solving MDPs, presenting a family of MDPs instances for which no algorithm can converge faster than VI when the number of iterations is smaller than the number of states.

[1]  Joelle Pineau,et al.  Temporal Regularization in Markov Decision Process , 2018, ArXiv.

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  W. Beyn,et al.  Stability and paracontractivity of discrete linear inclusions , 2000 .

[4]  R. Jungers The Joint Spectral Radius: Theory and Applications , 2009 .

[5]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[6]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[7]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[8]  Randy Cogill,et al.  Reversible Markov Decision Processes with an Average-Reward Criterion , 2013, SIAM J. Control. Optim..

[9]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[10]  Yinyu Ye,et al.  The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..

[11]  C. E. Chidume,et al.  Geometric Properties of Banach Spaces and Nonlinear Iterations , 2009 .

[12]  Euhanna Ghadimi,et al.  Global convergence of the Heavy-ball method for convex optimization , 2014, 2015 European Control Conference (ECC).

[13]  P. R. Kumar,et al.  Performance bounds for queueing networks and scheduling policies , 1994, IEEE Trans. Autom. Control..

[14]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[15]  P. Tetali Random walks and the effective resistance of networks , 1991 .

[16]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[17]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[18]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[19]  Stephen P. Boyd,et al.  Globally Convergent Type-I Anderson Acceleration for Nonsmooth Fixed-Point Iterations , 2018, SIAM J. Optim..

[20]  Matthieu Geist,et al.  Anderson Acceleration for Reinforcement Learning , 2018, EWRL 2018.

[21]  David J. Aldous,et al.  Lower bounds for covering times for reversible Markov chains and random walks on graphs , 1989 .

[22]  R. Cogill,et al.  Suboptimality Bounds in Stochastic Control: A Queueing Example , 2006, 2006 American Control Conference.

[23]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[24]  John N. Tsitsiklis,et al.  The Lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate , 1997, Math. Control. Signals Syst..

[25]  Richard G. Baraniuk,et al.  Fast Alternating Direction Optimization Methods , 2014, SIAM J. Imaging Sci..

[26]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[27]  David Barber,et al.  Approximate Newton Methods for Policy Search in Markov Decision Processes , 2016, J. Mach. Learn. Res..

[28]  Peter G. Doyle,et al.  Random Walks and Electric Networks: REFERENCES , 1987 .

[29]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[30]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[31]  Alessandro Lazaric,et al.  Active Exploration in Markov Decision Processes , 2019, AISTATS.

[32]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[33]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[34]  Harold J. Kushner,et al.  Accelerated procedures for the solution of discrete Markov control problems , 1971 .

[35]  Uri Yechiali,et al.  A K-step look-ahead analysis of value iteration algorithms for Markov decision processes , 1996 .

[36]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[37]  Chi-Guhn Lee,et al.  Acceleration Operators in the Value Iteration Algorithms for Markov Decision Processes , 2010, Oper. Res..

[38]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[39]  Vincent D. Blondel,et al.  Computationally Efficient Approximations of the Joint Spectral Radius , 2005, SIAM J. Matrix Anal. Appl..

[40]  Matthieu Geist,et al.  Softened Approximate Policy Iteration for Markov Games , 2016, ICML.

[41]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[42]  Evan L. Porteus,et al.  Technical Note - Accelerated Computation of the Expected Discounted Return in a Markov Chain , 1978, Oper. Res..

[43]  Stéphane Gaubert,et al.  The Operator Approach to Entropy Games , 2017, Theory of Computing Systems.

[44]  Shie Mannor,et al.  Robust MDPs with k-Rectangular Uncertainty , 2016, Math. Oper. Res..

[45]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[46]  Sean P. Meyn,et al.  Duality and linear programs for stability and performance analysis of queuing networks and scheduling policies , 1996, IEEE Trans. Autom. Control..

[47]  Martin L. Puterman,et al.  On the Convergence of Policy Iteration in Stationary Dynamic Programming , 1979, Math. Oper. Res..

[48]  J. Filar,et al.  On the Algorithm of Pollatschek and Avi-ltzhak , 1991 .

[49]  U. Yechiali,et al.  Accelerating Procedures of the Value Iteration Algorithm for Discounted Markov Decision Processes, Based on a One-Step Lookahead Analysis , 1994 .

[50]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[51]  Vladimir Yu. Protasov,et al.  The greedy strategy for optimizing the Perron eigenvalue , 2018, Mathematical Programming.

[52]  R. E. Kalman,et al.  Controllability of linear dynamical systems , 1963 .

[53]  Julien M. Hendrickx,et al.  A generic online acceleration scheme for optimization algorithms via relaxation and inertia , 2016, Optim. Methods Softw..

[54]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[55]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[56]  Vineet Goyal,et al.  Robust Markov Decision Process: Beyond Rectangularity , 2018, 1811.00215.

[57]  Emmanuel J. Candès,et al.  Templates for convex cone problems with applications to sparse signal recovery , 2010, Math. Program. Comput..