The loss from imperfect value functions in expectation-based and minimax-based tasks

Many reinforcement learning (RL) algorithms approximate an optimal value function. Once the function is known, it is easy to determine an optimal policy. For most real-world applications, however, the value function is too complex to be represented by lookup tables, making it necessary to use function approximators such as neural networks. In this case, convergence to the optimal value function is no longer guaranteed and it becomes important to know to which extent performance diminishes when one uses approximate value functions instead of optimal ones. This problem has recently been discussed in the context of expectation based Markov decision problems. Our analysis generalizes this work to minimax-based Markov decision problems, yields new results for expectation-based tasks, and shows how minimax-based and expectation based Markov decision problems relate.

[1]  Kathleen Martin,et al.  The Learning Machines. , 1981 .

[2]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[4]  C. Watkins Learning from delayed rewards , 1989 .

[5]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[6]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[7]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[8]  Long Ji Lin,et al.  Scaling Up Reinforcement Learning for Robot Control , 1993, International Conference on Machine Learning.

[9]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[10]  M.A.F. Mcdonald,et al.  Approximate Discounted Dynamic Programming Is Unreliable , 1994 .

[11]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[12]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[13]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[14]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[15]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[16]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[17]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[18]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[19]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[20]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[21]  Andrew W. Moore,et al.  The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces , 1993, Machine Learning.

[22]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[23]  Hamdy A. Taha,et al.  Operations Research: An Introduction (8th Edition) , 2006 .