A model-free robust policy iteration algorithm for optimal control of nonlinear systems

An online model-free solution is developed for the infinite-horizon optimal control problem for continuous-time nonlinear systems. A novel actor-critic-identifier (ACI) structure is used to implement the Policy Iteration algorithm, wherein two neural network structures are used - a robust dynamic neural network (DNN) to asymptotically identify the uncertain system with additive disturbances, and a critic NN to approximate the value function. The weight update laws for the critic NN are generated using a gradient-descent method based on a modified temporal difference error, which is independent of the system dynamics. The optimal control law (or the actor) is computed using the critic NN and the identifier DNN. Uniformly ultimately bounded (UUB) stability of the closed-loop system is guaranteed. The actor, critic and identifier structures are implemented in real-time, continuously and simultaneously.

[1]  Paul J. Webros A menu of designs for reinforcement learning over time , 1990 .

[2]  Miroslav Krstic,et al.  Nonlinear and adaptive control de-sign , 1995 .

[3]  Alexander S. Poznyak,et al.  Differential Neural Networks for Robust Nonlinear Control: Identification, State Estimation and Trajectory Tracking , 2001 .

[4]  Weiping Li,et al.  Applied Nonlinear Control , 1991 .

[5]  George G. Lendaris,et al.  Adaptive dynamic programming , 2002, IEEE Trans. Syst. Man Cybern. Part C.

[6]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[7]  Warren E. Dixon,et al.  Nonlinear Control of Engineering Systems , 2002 .

[8]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[9]  F. L. Lewis NONLINEAR NETWORK STRUCTURES FOR FEEDBACK CONTROL , 1999 .

[10]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[12]  Frank L. Lewis,et al.  Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  S. N. Balakrishnan,et al.  Adaptive-critic based neural networks for aircraft optimal control , 1996 .

[14]  George G. Lendaris,et al.  Adaptive critic design for intelligent steering and speed control of a 2-axle vehicle , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[15]  S. N. Balakrishnan,et al.  State-constrained agile missile control with adaptive-critic-based neural networks , 2002, IEEE Trans. Control. Syst. Technol..

[16]  Lyle Noakes,et al.  Continuous-Time Adaptive Critics , 2007, IEEE Transactions on Neural Networks.

[17]  Bernard Widrow,et al.  Punish/Reward: Learning with a Critic in Adaptive Threshold Systems , 1973, IEEE Trans. Syst. Man Cybern..

[18]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[19]  Paul J. Werbos,et al.  Approximate dynamic programming for real-time control and neural modeling , 1992 .

[20]  Frank L. Lewis,et al.  2009 Special Issue: Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems , 2009 .

[21]  Frank L. Lewis,et al.  Online Synchronous Policy Iteration Method for Optimal Control , 2009 .

[22]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[23]  Robert F. Stengel,et al.  An adaptive critic global controller , 2002, Proceedings of the 2002 American Control Conference (IEEE Cat. No.CH37301).

[24]  Jennie Si,et al.  Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence) , 2004 .

[25]  Frank L. Lewis,et al.  Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , 2005, Autom..

[26]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[27]  Richard S. Sutton,et al.  Reinforcement Learning is Direct Adaptive Optimal Control , 1992, 1991 American Control Conference.

[28]  Randal W. Beard,et al.  Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation , 1997, Autom..

[29]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[30]  J J Hopfield,et al.  Neurons with graded response have collective computational properties like those of two-state neurons. , 1984, Proceedings of the National Academy of Sciences of the United States of America.