Nonlinear two-player zero-sum game approximate solution using a Policy Iteration algorithm

An approximate online solution is developed for a two-player zero-sum game subject to continuous-time nonlinear uncertain dynamics and an infinite horizon quadratic cost. A novel actor-critic-identifier (ACI) structure is used to implement the Policy Iteration (PI) algorithm, wherein a robust dynamic neural network (DNN) is used to asymptotically identify the uncertain system, and a critic NN is used to approximate the value function. The weight update laws for the critic NN are generated using a gradient-descent method based on a modified temporal difference error, which is independent of the system dynamics. This method finds approximations of the optimal value function, and the saddle point feedback control policies. These policies are computed using the critic NN and the identifier DNN and guarantee uniformly ultimately bounded (UUB) stability of the closed-loop system. The actor, critic and identifier structures are implemented in real-time, continuously and simultaneously.

[1]  Paul J. Werbos,et al.  Approximate dynamic programming for real-time control and neural modeling , 1992 .

[2]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[3]  T. Başar,et al.  Dynamic Noncooperative Game Theory , 1982 .

[4]  Frank L. Lewis,et al.  Neuro-Fuzzy Control of Industrial Systems with Actuator Nonlinearities , 1987 .

[5]  Randal W. Beard,et al.  Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation , 1997, Autom..

[6]  Tamer Başar,et al.  H1-Optimal Control and Related Minimax Design Problems , 1995 .

[7]  Marcus Johnson,et al.  A model-free robust policy iteration algorithm for optimal control of nonlinear systems , 2010, 49th IEEE Conference on Decision and Control (CDC).

[8]  Frank L. Lewis,et al.  Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[10]  Richard S. Sutton,et al.  Reinforcement Learning is Direct Adaptive Optimal Control , 1992, 1991 American Control Conference.

[11]  Frank L. Lewis,et al.  2009 Special Issue: Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems , 2009 .

[12]  F. Lewis,et al.  Model-free Q-learning designs for discrete-time zero-sum games with application to H-infinity control , 2007, 2007 European Control Conference (ECC).

[13]  Victor M. Becerra,et al.  Optimal control , 2008, Scholarpedia.

[14]  S. N. Balakrishnan,et al.  Adaptive-critic based neural networks for aircraft optimal control , 1996 .

[15]  Donald A. Sofge,et al.  Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[16]  Weiping Li,et al.  Applied Nonlinear Control , 1991 .

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Frank L. Lewis,et al.  Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , 2005, Autom..

[19]  A. Schaft L/sub 2/-gain analysis of nonlinear systems and nonlinear state-feedback H/sub infinity / control , 1992 .

[20]  Frank L. Lewis,et al.  Adaptive critic neural network for feedforward compensation , 1999, Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251).

[21]  George G. Lendaris,et al.  Adaptive critic design for intelligent steering and speed control of a 2-axle vehicle , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[22]  Frank L. Lewis,et al.  Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control , 2007, Autom..

[23]  J J Hopfield,et al.  Neurons with graded response have collective computational properties like those of two-state neurons. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Frank L. Lewis,et al.  Online Synchronous Policy Iteration Method for Optimal Control , 2009 .

[25]  S. N. Balakrishnan,et al.  State-constrained agile missile control with adaptive-critic-based neural networks , 2002, IEEE Trans. Control. Syst. Technol..

[26]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[27]  George G. Lendaris,et al.  Adaptive dynamic programming , 2002, IEEE Trans. Syst. Man Cybern. Part C.

[28]  Alexander S. Poznyak,et al.  Differential Neural Networks for Robust Nonlinear Control: Identification, State Estimation and Trajectory Tracking , 2001 .

[29]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[30]  Robert F. Stengel,et al.  An adaptive critic global controller , 2002, Proceedings of the 2002 American Control Conference (IEEE Cat. No.CH37301).

[31]  Stef Tijs,et al.  Introduction to Game Theory , 2003 .

[32]  F. L. Lewis NONLINEAR NETWORK STRUCTURES FOR FEEDBACK CONTROL , 1999 .

[33]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[34]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.