Basis function adaptation methods for cost approximation in MDP

We generalize a basis adaptation method for cost approximation in Markov decision processes (MDP), extending earlier work of Menache, Mannor, and Shimkin. In our context, basis functions are parametrized and their parameters are tuned by minimizing an objective function involving the cost function approximation obtained when a temporal differences (TD) or other method is used. The adaptation scheme involves only low order calculations and can be implemented in a way analogous to policy gradient methods. In the generalized basis adaptation framework we provide extensions to TD methods for nonlinear optimal stopping problems and to alternative cost approximations beyond those based on TD.

[1]  Stephen M. Robinson,et al.  An Implicit-Function Theorem for a Class of Nonsmooth Functions , 1991, Math. Oper. Res..

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[5]  R. Tyrrell Rockafellar,et al.  Variational Analysis , 1998, Grundlehren der mathematischen Wissenschaften.

[6]  Richard S. Sutton,et al.  Dimensions of Reinforcement Learning , 1998 .

[7]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[8]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[9]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[10]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[11]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[12]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[13]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[14]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[15]  Vivek S. Borkar,et al.  Stochastic approximation with 'controlled Markov' noise , 2006, Systems & control letters (Print).

[16]  D. Bertsekas,et al.  A Least Squares Q-Learning Algorithm for Optimal Stopping Problems , 2007 .

[17]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[18]  R. Tyrrell Rockafellar,et al.  Robinson’s implicit function theorem and its extensions , 2009, Math. Program..