Reinforcement Learning With High-Dimensional, Continuous Actions

Abstract : Many reinforcement learning systems, such as Q-learning (Watkins, 1989), or advantage updating (Baird, 1993), require that a function f(x,u) be learned, and that the value of argmax f(x,u) be calculated quickly for any given x. The function f could be learned by a function approximation system such as a multilayer preceptron, but the maximum of f for a given x cannot found analytically and is difficult to approximate numerically for high-dimensional u vectors. A new method is proposed, wire fitting, in which a function approximation system is used to learn a set of functions called control wires, and the function f is found by fitting a surface to the control wires. Wire fitting has the following four properties: (1) any continuous f function can represented to any desired accuracy given sufficient parameters; (2) the function f(x,u) can be evaluated quickly; (3) argmax f(x,u) can found exactly in constant time after evaluating f(x,U); (4) wire fitting can incorporate any general function approximation system. These four properties are discussed and it is shown how wire fitting can be combined with a memory-based learning system and Q-learning to control an inverted-pendulum system

[1]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[2]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[3]  Gerald Tesauro,et al.  Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[4]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[5]  Peter J. Millington,et al.  Associative reinforcement learning for optimal control , 1991 .

[6]  V. Gullapalli,et al.  Associative reinforcement learning of real-valued functions , 1991, Conference Proceedings 1991 IEEE International Conference on Systems, Man, and Cybernetics.

[7]  L. C. Baird Function minimization for dynamic programming using connectionist networks , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[8]  James S. Morgan,et al.  A Hierarchical Network of Control Systems that Learn: Modeling Nervous System Function During Classical and Instrumental Conditioning , 1993, Adapt. Behav..

[9]  A. Harry Klopf,et al.  A Hierarchical Network of Provably Optimal Learning Control Systems: Extensions of the Associative Control Process (ACP) Network , 1993, Adapt. Behav..

[10]  A. Harry Klopf,et al.  Extensions of the associative control process (ACP) network: hierarchies and provable optimality , 1993 .

[11]  A. Harry Klopf,et al.  Modeling nervous system function with a hierarchical network of control systems that learn , 1993 .

[12]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..