Online Synchronous Policy Iteration Method for Optimal Control

In this chapter, we discuss an online algorithm based on policy iteration (PI) for learning the continuous-time (CT) optimal control solution for nonlinear systems with infinite horizon costs. We present an online adaptive algorithm implemented as an actor/critic structure which involves simultaneous continuous-time adaptation of both actor and critic neural networks. We call this “synchronous” PI. A persistence of excitation condition is shown to guarantee convergence of the critic to the actual optimal value function. Novel tuning algorithms are given for both critic and actor networks, with extra terms in the actor tuning law being required to guarantee closed-loop dynamical stability. The convergence to the optimal controller is proven, and stability of the system is also guaranteed. Simulation examples show the effectiveness of the new algorithm.

[1]  Frank L. Lewis,et al.  Neural net robot controller with guaranteed tracking performance , 1995, IEEE Trans. Neural Networks.

[2]  W. Haddad,et al.  Nonlinear Dynamical Systems and Control: A Lyapunov-Based Approach , 2008 .

[3]  R. Beard,et al.  Successive collocation: an approximation to optimal nonlinear control , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[4]  Lyle Noakes,et al.  Continuous-Time Adaptive Critics , 2007, IEEE Transactions on Neural Networks.

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[7]  Frank L. Lewis,et al.  Adaptive optimal control algorithm for continuous-time nonlinear systems based on policy iteration , 2008, 2008 47th IEEE Conference on Decision and Control.

[8]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[9]  Randal W. Beard,et al.  Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation , 1997, Autom..

[10]  L. C. Baird,et al.  Reinforcement learning in continuous time: advantage updating , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[11]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[12]  Frank L. Lewis,et al.  Optimal Control , 1986 .

[13]  Frank L. Lewis,et al.  Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , 2005, Autom..

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[16]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[17]  D. Kleinman On an iterative technique for Riccati equation computations , 1968 .

[18]  George G. Lendaris,et al.  Adaptive dynamic programming , 2002, IEEE Trans. Syst. Man Cybern. Part C.

[19]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[20]  Frank L. Lewis,et al.  Adaptive optimal control for continuous-time linear systems based on policy iteration , 2009, Autom..