Reinforcement learning is direct adaptive optimal control

Neural network reinforcement learning methods are described and considered as a direct approach to adaptive optimal control of nonlinear systems. These methods have their roots in studies of animal learning and in early learning control work. An emerging deeper understanding of these methods is summarized that is obtained by viewing them as a synthesis of dynamic programming and stochastic approximation methods. The focus is on Q-learning systems, which maintain estimates of utilities for all state-action pairs and make use of these estimates to select actions. The use of hybrid direct/indirect methods is briefly discussed.<<ETX>>

[1]  P. Mandl,et al.  Estimation and control in Markov chains , 1974, Advances in Applied Probability.

[2]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[3]  V. Borkar,et al.  Adaptive control of Markov chains, I: Finite parameter set , 1979 .

[4]  P. Kumar,et al.  Optimal adaptive controllers for unknown Markov chains , 1982 .

[5]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[6]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[7]  Richard Wheeler,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[8]  Richard S. Sutton,et al.  Training and Tracking in Robotics , 1985, IJCAI.

[9]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[10]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[11]  A. Jalali,et al.  Computationally efficient adaptive control algorithms for Markov chains , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[12]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[13]  Andrew G. Barto,et al.  On the Computational Economics of Reinforcement Learning , 1991 .

[14]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[15]  Long-Ji Lin,et al.  Self-improving reactive agents: case studies of reinforcement learning frameworks , 1991 .