Performance Loss Bounds for Approximate Value Iteration with State Aggregation

We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to using invariant distributions of appropriate policies as projection weights. Such projection weighting relates to what is done by temporal-difference learning. Our analysis also leads to the first performance loss bound for approximate value iteration with an average-cost objective.

[1]  Rutherford Aris,et al.  Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[2]  B. Fox Discretizing dynamic programs , 1973 .

[3]  D. Bertsekas Convergence of discretization procedures in dynamic programming , 1975 .

[4]  Ward Whitt,et al.  Approximations of Dynamic Programs, I , 1978, Math. Oper. Res..

[5]  Thomas L. Morin,et al.  COMPUTATIONAL ADVANCES IN DYNAMIC PROGRAMMING , 1978 .

[6]  K. Hinderer ON APPROXIMATE SOLUTIONS OF FINITE-STAGE DYNAMIC PROGRAMS , 1978 .

[7]  Roy Mendelssohn,et al.  An Iterative Aggregation Procedure for Markov Decision Processes , 1982, Oper. Res..

[8]  Sven Axsäter,et al.  State aggregation in dynamic programming - An application to scheduling of independent jobs on parallel processors , 1983 .

[9]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[10]  John R. Birge,et al.  Aggregation bounds in stochastic linear programming , 1985, Math. Program..

[11]  Robert L. Smith,et al.  Aggregation in Dynamic Programming , 1987, Oper. Res..

[12]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[13]  John N. Tsitsiklis,et al.  The complexity of dynamic programming , 1989, J. Complex..

[14]  Paul J. Werbos,et al.  Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[15]  J. Tsitsiklis,et al.  An optimal one-way multigrid algorithm for discrete-time stochastic control , 1991 .

[16]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[17]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[18]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[19]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[20]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[21]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[22]  Jérôme Barraquand,et al.  Numerical Valuation of High Dimensional Multivariate American Securities , 1995, Journal of Financial and Quantitative Analysis.

[23]  Dimitri P. Bertsekas,et al.  A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[24]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[25]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[26]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[27]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[28]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[29]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[30]  Benjamin Van Roy Learning and value function approximation in complex decision processes , 1998 .

[31]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[32]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[33]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[34]  H. Kushner Numerical Methods for Stochastic Control Problems in Continuous Time , 2000 .

[35]  Benjamin Van Roy,et al.  On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[36]  Benjamin Van Roy,et al.  Approximate Dynamic Programming via Linear Programming , 2001, NIPS.

[37]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[38]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[39]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[40]  John N. Tsitsiklis,et al.  On Average Versus Discounted Reward Temporal-Difference Learning , 2002, Machine Learning.

[41]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[42]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[43]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[44]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[45]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[46]  Aggregation in Stochastic Dynamic Programming , 2004 .

[47]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[48]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[49]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[51]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[52]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[53]  R. Sutton On The Virtues of Linear Learning and Trajectory Distributions , 2007 .

[54]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .