Reinforcement Learning is Direct Adaptive Optimal Control

Control problems can be divided into two classes: 1) regulation and tracking problems, in which the objective is to follow a reference trajectory, and 2) optimal control problems, in which the objective is to extremize a functional of the controlled system's behavior that is not necessarily defined in terms of a reference trajectory. Adaptive methods for problems of the first kind are well known, and include self-tuning regulators and model-reference methods, whereas adaptive methods for optimal-control problems have received relatively little attention. Moreover, the adaptive optimal-control methods that have been studied are almost all indirect methods, in which controls are recomputed from an estimated system model at each step. This computation is inherently complex, making adaptive methods in which the optimal controls are estimated directly more attractive. Here we present reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems.

[1]  P. Mandl,et al.  Estimation and control in Markov chains , 1974, Advances in Applied Probability.

[2]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[3]  V. Borkar,et al.  Adaptive control of Markov chains, I: Finite parameter set , 1979 .

[4]  P. Kumar,et al.  Optimal adaptive controllers for unknown Markov chains , 1982 .

[5]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[6]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[7]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[8]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Richard Wheeler,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[10]  Richard S. Sutton,et al.  Training and Tracking in Robotics , 1985, IJCAI.

[11]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  MITSUO SATO,et al.  Learning control of finite Markov chains with an explicit trade-off between estimation and control , 1988, IEEE Trans. Syst. Man Cybern..

[13]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[14]  C.W. Anderson,et al.  Learning to control an inverted pendulum using neural networks , 1989, IEEE Control Systems Magazine.

[15]  Richard S. Sutton,et al.  Learning and Sequential Decision Making , 1989 .

[16]  A. Jalali,et al.  Computationally efficient adaptive control algorithms for Markov chains , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[17]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[18]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[19]  Andrew G. Barto,et al.  Connectionist learning for control: an overview , 1990 .

[20]  Andrew G. Barto,et al.  On the Computational Economics of Reinforcement Learning , 1991 .

[21]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[22]  Long-Ji Lin,et al.  Self-improving reactive agents: case studies of reinforcement learning frameworks , 1991 .

[23]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..