Q-learning algorithms for optimal stopping based on least squares

We consider the solution of discounted optimal stopping problems using linear function approximation methods. A Q-learning algorithm for such problems, proposed by Tsitsiklis and Van Roy, is based on the method of temporal differences and stochastic approximation. We propose alternative algorithms, which are based on projected value iteration ideas and least squares. We prove the convergence of some of these algorithms and discuss their properties.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[2]  C. Watkins Learning from delayed rewards , 1989 .

[3]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[4]  Jérôme Barraquand,et al.  Numerical Valuation of High Dimensional Multivariate American Securities , 1995, Journal of Financial and Quantitative Analysis.

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[8]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[9]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[10]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[11]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[12]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[13]  Francis A. Longstaff,et al.  Valuing American Options by Simulation: A Simple Least-Squares Approach , 2001 .

[14]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[15]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[16]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[17]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[18]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[19]  D. Bertsekas,et al.  A Least Squares Q-Learning Algorithm for Optimal Stopping Problems , 2007 .

[20]  Bayu Jayawardhana,et al.  European Control Conference 2007 , 2007 .

[21]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.