Q-learning and policy iteration algorithms for stochastic shortest path problems

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.

[1]  L. A. Zadeh,et al.  Optimal Pursuit Strategies in Discrete-State Probabilistic Systems , 1962 .

[2]  A. F. Veinott Discrete Dynamic Programming with Sensitive Discount Optimality Criteria , 1969 .

[3]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[4]  Gérard M. Baudet,et al.  Asynchronous Iterative Methods for Multiprocessors , 1978, JACM.

[5]  Peter Whittle,et al.  Optimization Over Time , 1982 .

[6]  Dimitri P. Bertsekas,et al.  Distributed asynchronous computation of fixed points , 1983, Math. Program..

[7]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[8]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[9]  Eugene A. Feinberg,et al.  On Stationary Strategies in Borel Dynamic Programming , 1992, Math. Oper. Res..

[10]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[11]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[14]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[15]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[17]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[18]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[19]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[20]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[21]  Vdi,et al.  European control conference ECC'99 , 1999 .

[22]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[23]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[24]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[25]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[26]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[27]  D. Bertsekas,et al.  A Least Squares Q-Learning Algorithm for Optimal Stopping Problems , 2007 .

[28]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[29]  Robin Moore In Place Of , 2009 .

[30]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[31]  Dimitri P. Bertsekas,et al.  Distributed asynchronous policy iteration in dynamic programming , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[32]  B. Scherrer,et al.  Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems , 2010, ICML.

[33]  B. Scherrer,et al.  Performance bound for Approximate Optimistic Policy Iteration , 2010 .

[34]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[35]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[36]  Dimitri P. Bertsekas,et al.  On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems , 2013, Math. Oper. Res..

[37]  Uriel G. Rothblum,et al.  (Approximate) iterated successive approximations algorithm for sequential decision processes , 2013, Ann. Oper. Res..

[38]  J. Walrand,et al.  Distributed Dynamic Programming , 2022 .