Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse

On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes ("rollouts"), as in classical Monte Carlo methods, and as in the TD(λ) algorithm when λ = 1. However, in our experiments this always resulted in substantially poorer performance. We conclude that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general λ.

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  R.M. Dunn,et al.  Brains, behavior, and robotics , 1983, Proceedings of the IEEE.

[3]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[4]  C. Watkins Learning from delayed rewards , 1989 .

[5]  Mark W. Spong,et al.  Robot dynamics and control , 1989 .

[6]  W. T. Miller,et al.  CMAC: an associative neural network alternative to backpropagation , 1990, Proc. IEEE.

[7]  Hyongsuk Kim,et al.  CMAC-based adaptive critic self-learning control , 1991, IEEE Trans. Neural Networks.

[8]  Thomas Dean,et al.  Reinforcement Learning for Planning and Control , 1993 .

[9]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[10]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[11]  Mark W. Spong,et al.  Swinging up the Acrobot: an example of intelligent control , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[12]  Chen-Khong Tham,et al.  Modular on-line function approximation for scaling up reinforcement learning , 1994 .

[13]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[14]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[15]  Dimitri P. Bertsekas,et al.  A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[16]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[17]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[18]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[19]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.