Off-policy based adaptive dynamic programming method for nonzero-sum games on discrete-time system

Abstract In this paper, a novel model-free reinforcement learning method based on off-policy is introduced to solve nonzero-sum games of discrete-time linear systems. Compared with the traditional policy iteration (PI) method, which requires the knowledge of system dynamics, the proposed method can be trained by state data directly. Moreover, the traditional PI method is proved to be influenced by probing noises. In the analysis of the proposed method, the probing noises are specifically considered and proved to have no influence on the convergence. The solution of the optimal Nash equilibrium is deduced. It is also proved that the proposed algorithm can be applied in both online manner and offline manner. A simulation of the nonzero-sum games control problem on an F-16 aircraft discrete-time system is presented, and the results verify the effectiveness of the proposed algorithm.

[1]  Huaguang Zhang,et al.  General value iteration based reinforcement learning for solving optimal tracking control problem of continuous-time affine nonlinear systems , 2017, Neurocomputing.

[2]  Yanhong Luo,et al.  Data-driven optimal tracking control for a class of affine non-linear continuous-time systems with completely unknown dynamics , 2016 .

[3]  Randal W. Beard,et al.  Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation , 1997, Autom..

[4]  Hamid Reza Karimi,et al.  A Robust Observer-Based Sensor Fault-Tolerant Control for PMSM in Electric Vehicles , 2016, IEEE Transactions on Industrial Electronics.

[5]  Huaguang Zhang,et al.  Online optimal control of unknown discrete-time nonlinear systems by using time-based adaptive dynamic programming , 2015, Neurocomputing.

[6]  Derong Liu,et al.  Observer based adaptive dynamic programming for fault tolerant control of a class of nonlinear systems , 2017, Inf. Sci..

[7]  Tingwen Huang,et al.  Reinforcement learning solution for HJB equation arising in constrained optimal control problem , 2015, Neural Networks.

[8]  Huaguang Zhang,et al.  Event-Triggered-Based Distributed Cooperative Energy Management for Multienergy Systems , 2019, IEEE Transactions on Industrial Informatics.

[9]  Huaguang Zhang,et al.  A distributed Newton–Raphson-based coordination algorithm for multi-agent optimization with discrete-time communication , 2018, Neural Computing and Applications.

[10]  Richard S. Sutton,et al.  A Menu of Designs for Reinforcement Learning Over Time , 1995 .

[11]  Rong Su,et al.  Polynomial approach to optimal one-wafer cyclic scheduling of treelike hybrid multi-cluster tools via Petri nets , 2018, IEEE/CAA Journal of Automatica Sinica.

[12]  Zhong-Ping Jiang,et al.  Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics , 2012, Autom..

[13]  Rongrong Wang,et al.  Actuator and sensor faults estimation based on proportional integral observer for TS fuzzy model , 2017, J. Frankl. Inst..

[14]  Frank L. Lewis,et al.  Optimal Control , 1986 .

[15]  Huaguang Zhang,et al.  Near-Optimal Control for Nonzero-Sum Differential Games of Continuous-Time Nonlinear Systems Using Single-Network ADP , 2013, IEEE Transactions on Cybernetics.

[16]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[17]  Haibo He,et al.  GrDHP: A General Utility Function Representation for Dual Heuristic Dynamic Programming , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Frank L. Lewis,et al.  H∞ control of linear discrete-time systems: Off-policy reinforcement learning , 2017, Autom..

[19]  Derong Liu,et al.  Integral Reinforcement Learning for Linear Continuous-Time Zero-Sum Games With Completely Unknown Dynamics , 2014, IEEE Transactions on Automation Science and Engineering.

[20]  Derong Liu,et al.  Neuro-optimal control for a class of unknown nonlinear dynamic systems using SN-DHP technique , 2013, Neurocomputing.

[21]  Hamid Reza Karimi,et al.  A mixed 0-1 linear programming approach to the computation of all pure-strategy nash equilibria of a finite n -person game in normal form , 2014 .

[22]  David G. Hull,et al.  Optimal Control Theory for Applications , 2003 .

[23]  Chaomin Luo,et al.  Discrete-Time Nonzero-Sum Games for Multiplayer Using Policy-Iteration-Based Adaptive Dynamic Programming Algorithms , 2017, IEEE Transactions on Cybernetics.

[24]  Huaguang Zhang,et al.  General value iteration based single network approach for constrained optimal controller design of partially-unknown continuous-time nonlinear systems , 2018, J. Frankl. Inst..

[25]  Yu Liu,et al.  Optimal constrained self-learning battery sequential management in microgrid via adaptive dynamic programming , 2017, IEEE/CAA Journal of Automatica Sinica.

[26]  Derong Liu,et al.  Online Synchronous Approximate Optimal Learning Algorithm for Multi-Player Non-Zero-Sum Games With Unknown Dynamics , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[27]  Frank L. Lewis,et al.  Multiple Actor-Critic Structures for Continuous-Time Optimal Control Using Input-Output Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[28]  Derong Liu,et al.  Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Qichao Zhang,et al.  Experience Replay for Optimal Control of Nonzero-Sum Game Systems With Unknown Dynamics , 2016, IEEE Transactions on Cybernetics.

[30]  Frank L. Lewis,et al.  Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[31]  Huaguang Zhang,et al.  An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games , 2011, Autom..

[32]  Frank L. Lewis,et al.  Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton-Jacobi equations , 2011, Autom..

[33]  Frank L. Lewis,et al.  Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , 2005, Autom..

[34]  Haibo He,et al.  An Event-Triggered ADP Control Approach for Continuous-Time System With Unknown Internal States , 2017, IEEE Transactions on Cybernetics.

[35]  Behzad Moshiri,et al.  Haar Wavelet-Based Approach for Optimal Control of Second-Order Linear Systems in Time Domain , 2005 .

[36]  Huaguang Zhang,et al.  Adaptive Dynamic Programming: An Introduction , 2009, IEEE Computational Intelligence Magazine.

[37]  Derong Liu,et al.  Neural-network-based zero-sum game for discrete-time nonlinear systems via iterative adaptive dynamic programming algorithm , 2013, Neurocomputing.

[38]  Huai‐Ning Wu,et al.  Computationally efficient simultaneous policy update algorithm for nonlinear H∞ state feedback control with Galerkin's method , 2013 .

[39]  Tingwen Huang,et al.  Off-Policy Reinforcement Learning for $ H_\infty $ Control Design , 2013, IEEE Transactions on Cybernetics.

[40]  T. Başar,et al.  Dynamic Noncooperative Game Theory , 1982 .

[41]  Kun Zhang,et al.  Iterative adaptive dynamic programming methods with neural network implementation for multi-player zero-sum games , 2018, Neurocomputing.

[42]  Derong Liu,et al.  Decentralized guaranteed cost control of interconnected systems with uncertainties: A learning-based optimal control strategy , 2016, Neurocomputing.

[43]  Tingwen Huang,et al.  Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design , 2014, Autom..

[44]  Hamid Reza Karimi,et al.  A computational method for solving optimal control and parameter estimation of linear systems using Haar wavelets , 2004, Int. J. Comput. Math..

[45]  George G. Lendaris,et al.  Adaptive dynamic programming , 2002, IEEE Trans. Syst. Man Cybern. Part C.

[46]  Kun Zhang,et al.  Value iteration based integral reinforcement learning approach for H∞ controller design of continuous-time nonlinear systems , 2018, Neurocomputing.

[47]  Tingwen Huang,et al.  Data-Driven $H_\infty$ Control for Nonlinear Distributed Parameter Systems , 2015, IEEE Transactions on Neural Networks and Learning Systems.