论文信息 - Reinforcement learning in continuous time: advantage updating

Reinforcement learning in continuous time: advantage updating

A new algorithm for reinforcement learning, advantage updating, is described. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which standard algorithms such as Q-learning are not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem, advantage updating learns more quickly than Q-learning by a factor of 100,000 when the time step is small. Even for large time steps, advantage updating is never slower than Q-learning, and advantage updating is more resistant to noise than is Q-learning. Convergence properties are discussed. It is proved that the learning rule for advantage updating converges to the optimal policy with probability one.<<ETX>>

L. C. Baird | L. Baird | Leemon C Baird Iii

[1] R. Bellman. Dynamic programming. , 1957, Science.

[2] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[3] L. Baird,et al. A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING , 1990 .

[4] L. C. Baird. Function minimization for dynamic programming using connectionist networks , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[5] Steven J. Bradtke,et al. Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[6] James S. Morgan,et al. A Hierarchical Network of Control Systems that Learn: Modeling Nervous System Function During Classical and Instrumental Conditioning , 1993, Adapt. Behav..

[7] A. Harry Klopf,et al. A Hierarchical Network of Provably Optimal Learning Control Systems: Extensions of the Associative Control Process (ACP) Network , 1993, Adapt. Behav..

[8] Leemon C Baird,et al. Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[9] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[10] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..