Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem

In this paper we discuss an online algorithm based on policy iteration for learning the continuous-time (CT) optimal control solution with infinite horizon cost for nonlinear systems with known dynamics. We present an online adaptive algorithm implemented as an actor/critic structure which involves simultaneous continuous-time adaptation of both actor and critic neural networks. We call this ‘synchronous’ policy iteration. A persistence of excitation condition is shown to guarantee convergence of the critic to the actual optimal value function. Novel tuning algorithms are given for both critic and actor networks, with extra terms in the actor tuning law being required to guarantee closed-loop dynamical stability. The convergence to the optimal controller is proven, and stability of the system is also guaranteed. Simulation examples show the effectiveness of the new algorithm.

[1]  K. N. Dollman,et al.  - 1 , 1743 .

[2]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[3]  D. Kleinman On an iterative technique for Riccati equation computations , 1968 .

[4]  B. Finlayson The method of weighted residuals and variational principles : with application in fluid mechanics, heat and mass transfer , 1972 .

[5]  W. Ames The Method of Weighted Residuals and Variational Principles. By B. A. Finlayson. Academic Press, 1972. 412 pp. $22.50. , 1973, Journal of Fluid Mechanics.

[6]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[7]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[8]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[9]  A. Schaft L/sub 2/-gain analysis of nonlinear systems and nonlinear state-feedback H/sub infinity / control , 1992 .

[10]  Frank L. Lewis,et al.  Aircraft Control and Simulation , 1992 .

[11]  L. C. Baird,et al.  Reinforcement learning in continuous time: advantage updating , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[12]  Eduardo Sontag,et al.  Nonsmooth control-Lyapunov functions , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[13]  Frank L. Lewis,et al.  Neural net robot controller with guaranteed tracking performance , 1995, IEEE Trans. Neural Networks.

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15]  J. Primbs,et al.  Constrained nonlinear optimal control: a converse HJB approach , 1996 .

[16]  Randal W. Beard,et al.  Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation , 1997, Autom..

[17]  F. Lewis,et al.  Neural Network Control of Robot Arms and Nonlinear Systems , 1997 .

[18]  I. Sandberg Notes on uniform approximation of time-varying systems on finite time intervals , 1998 .

[19]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[20]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[21]  R. Beard,et al.  Successive collocation: an approximation to optimal nonlinear control , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[22]  George G. Lendaris,et al.  Adaptive dynamic programming , 2002, IEEE Trans. Syst. Man Cybern. Part C.

[23]  Van,et al.  L2-Gain Analysis of Nonlinear Systems and Nonlinear State Feedback H∞ Control , 2004 .

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25]  Frank L. Lewis,et al.  Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , 2005, Autom..

[26]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[27]  James V. Candy,et al.  Adaptive and Learning Systems for Signal Processing, Communications, and Control , 2006 .

[28]  Lyle Noakes,et al.  Continuous-Time Adaptive Critics , 2007, IEEE Transactions on Neural Networks.

[29]  Frank L. Lewis,et al.  Adaptive optimal control algorithm for continuous-time nonlinear systems based on policy iteration , 2008, 2008 47th IEEE Conference on Decision and Control.

[30]  Victor M. Becerra,et al.  Optimal control , 2008, Scholarpedia.

[31]  Frank L. Lewis,et al.  Adaptive optimal control for continuous-time linear systems based on policy iteration , 2009, Autom..

[32]  Draguna Vrabie,et al.  Adaptive optimal controllers based on Generalized Policy Iteration in a continuous-time framework , 2009, 2009 17th Mediterranean Conference on Control and Automation.

[33]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .