论文信息 - A DVANTAGE U PDATING

A DVANTAGE U PDATING

A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which Q-learning is not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem with no noise and large time steps, advantage updating learns slightly faster than Qlearning. When there is noise or small time steps, advantage updating learns more quickly than Q-learning by a factor of more than 100,000. Convergence properties and implementation issues are discussed. New convergence results are presented for R-learning and algorithms based upon change in value. It is proved that the learning rule for advantage updating converges to the optimal policy with probability one.

L. Baird

[1] L. Baird,et al. A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING (cid:3) , 1990 .

[2] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[3] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[4] James S. Morgan,et al. A Hierarchical Network of Control Systems that Learn: Modeling Nervous System Function During Classical and Instrumental Conditioning , 1993, Adapt. Behav..

[5] A. Harry Klopf,et al. A Hierarchical Network of Provably Optimal Learning Control Systems: Extensions of the Associative Control Process (ACP) Network , 1993, Adapt. Behav..

[6] Donald A. Sofge,et al. Neural network based process optimization and control , 1990, 29th IEEE Conference on Decision and Control.

[7] C. Watkins. Learning from delayed rewards , 1989 .

[8] Vijaykumar Gullapalli,et al. A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[9] L. C. Baird. Function minimization for dynamic programming using connectionist networks , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[10] G. Tesauro. Practical Issues in Temporal Difference Learning , 1992 .

[11] Anton Schwartz,et al. A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[12] Steven J. Bradtke,et al. Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[13] Ronald J. Williams,et al. Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[14] A. Harry Klopf,et al. Advantage Updating Applied to a Differrential Game , 1994, NIPS.

[15] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[16] B. Widrow,et al. Neural networks for self-learning control systems , 1990, IEEE Control Systems Magazine.

[17] Donald A. Sofge,et al. Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[18] M. F.,et al. Bibliography , 1985, Experimental Gerontology.

[19] Leemon C Baird,et al. Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[20] David Q. Mayne,et al. Differential dynamic programming , 1972, The Mathematical Gazette.