LIDS REPORT 2871 1 Q-Learning and Policy Iteration Algorithms for Stochastic Shortest Path Problems ∗

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in [BY10b]. The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Qlearning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iterationlike alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Sep 2011; revised Mar 2012 ∗Work supported by the Air Force Grant FA9550-10-1-0412 and by NSF Grant ECCS-0801549. †Huizhen Yu is with the Lab. for Information and Decision Systems, M.I.T., Cambridge, Mass., 02139. janey yu@mit.edu ‡Dimitri Bertsekas is with the Dept. of Electr. Engineering and Comp. Science, and the Lab. for Information and Decision Systems, M.I.T., Cambridge, Mass., 02139. dimitrib@mit.edu

[1]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[2]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[3]  Dimitri Bertsekas,et al.  Distributed dynamic programming , 1981, 1981 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[4]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[5]  L. A. Zadeh,et al.  Optimal Pursuit Strategies in Discrete-State Probabilistic Systems , 1962 .

[6]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[7]  B. Scherrer,et al.  Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems , 2010, ICML.

[8]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[9]  A. F. Veinott Discrete Dynamic Programming with Sensitive Discount Optimality Criteria , 1969 .

[10]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[11]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[12]  Gérard M. Baudet,et al.  Asynchronous Iterative Methods for Multiprocessors , 1978, JACM.

[13]  D. Bertsekas,et al.  A Least Squares Q-Learning Algorithm for Optimal Stopping Problems , 2007 .

[14]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[15]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[16]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[17]  Uriel G. Rothblum,et al.  (Approximate) iterated successive approximations algorithm for sequential decision processes , 2013, Ann. Oper. Res..

[18]  Eugene A. Feinberg,et al.  On Stationary Strategies in Borel Dynamic Programming , 1992, Math. Oper. Res..