Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

[1]  Zhong-Ping Jiang,et al.  Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics , 2012, Autom..

[2]  Derong Liu,et al.  Error Bound Analysis of $Q$ -Function for Discounted Optimal Control Problems With Policy Iteration , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[3]  Huai-Ning Wu,et al.  Policy Gradient Adaptive Dynamic Programming for Data-Based Optimal Control , 2017, IEEE Transactions on Cybernetics.

[4]  Ruey-Wen Liu,et al.  Construction of Suboptimal Control Sequences , 1967 .

[5]  Frank L. Lewis,et al.  Off-Policy Q-Learning: Set-Point Design for Optimizing Dual-Rate Rougher Flotation Operational Processes , 2018, IEEE Transactions on Industrial Electronics.

[6]  Frank L. Lewis,et al.  $ {H}_{ {\infty }}$ Tracking Control of Completely Unknown Continuous-Time Systems via Off-Policy Reinforcement Learning , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[8]  Derong Liu,et al.  A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems , 2015, Science China Information Sciences.

[9]  Frank L. Lewis,et al.  Integral Reinforcement Learning for online computation of feedback Nash strategies of nonzero-sum differential games , 2010, 49th IEEE Conference on Decision and Control (CDC).

[10]  Alessandro Astolfi,et al.  Constructive $\epsilon$-Nash Equilibria for Nonzero-Sum Differential Games , 2015, IEEE Transactions on Automatic Control.

[11]  Paul J. Werbos,et al.  Approximate dynamic programming for real-time control and neural modeling , 1992 .

[12]  Kyriakos G. Vamvoudakis,et al.  Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach , 2017, Syst. Control. Lett..

[13]  Dongbing Gu,et al.  Construction of Barrier in a Fishing Game With Point Capture , 2017, IEEE Transactions on Cybernetics.

[14]  Jacob Engwerda,et al.  LQ Dynamic Optimization and Differential Games , 2005 .

[15]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[16]  Marcus Johnson,et al.  Approximate $N$ -Player Nonzero-Sum Game Solution for an Uncertain Continuous Nonlinear System , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Frank L. Lewis,et al.  Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-Time Multiplayer Nonzero-Sum Games , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Qichao Zhang,et al.  Experience Replay for Optimal Control of Nonzero-Sum Game Systems With Unknown Dynamics , 2016, IEEE Transactions on Cybernetics.

[19]  Tingwen Huang,et al.  Off-Policy Reinforcement Learning for $ H_\infty $ Control Design , 2013, IEEE Transactions on Cybernetics.

[20]  Frank L. Lewis,et al.  Multi-agent differential graphical games , 2011, Proceedings of the 30th Chinese Control Conference.

[21]  Tzyh Jong Tarn,et al.  Hybrid MDP based integrated hierarchical Q-learning , 2011, Science China Information Sciences.

[22]  Hisham Abou-Kandil,et al.  On global existence of solutions to coupled matrix Riccati equations in closed-loop Nash games , 1996, IEEE Trans. Autom. Control..

[23]  Huaguang Zhang,et al.  Near-Optimal Control for Nonzero-Sum Differential Games of Continuous-Time Nonlinear Systems Using Single-Network ADP , 2013, IEEE Transactions on Cybernetics.

[24]  Frank L. Lewis,et al.  Continuous-Time Q-Learning for Infinite-Horizon Discounted Cost Linear Quadratic Regulator Problems , 2015, IEEE Transactions on Cybernetics.

[25]  Qian Liu,et al.  Towards green for relay in InterPlaNetary Internet based on differential game model , 2014, Science China Information Sciences.

[26]  Frank L. Lewis,et al.  Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton-Jacobi equations , 2011, Autom..

[27]  R. Godson Elements of intelligence , 1979 .

[28]  Chaomin Luo,et al.  Discrete-Time Nonzero-Sum Games for Multiplayer Using Policy-Iteration-Based Adaptive Dynamic Programming Algorithms , 2017, IEEE Transactions on Cybernetics.

[29]  Richard B. Vinter,et al.  Differential Games Controllers That Confine a System to a Safe Region in the State Space, With Applications to Surge Tank Control , 2012, IEEE Transactions on Automatic Control.

[30]  Chaoxu Mu,et al.  Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism , 2016, Science China Information Sciences.

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32]  Frank L. Lewis,et al.  Adaptive dynamic programming for online solution of a zero-sum differential game , 2011 .

[33]  Frank L. Lewis,et al.  Neurodynamic Programming and Zero-Sum Games for Constrained Control Systems , 2008, IEEE Transactions on Neural Networks.

[34]  Corrado Possieri,et al.  An algebraic geometry approach for the computation of all linear feedback Nash equilibria in LQ differential games , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[35]  Kyriakos G. Vamvoudakis,et al.  Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems , 2015, Autom..

[36]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[37]  Randal W. Bea Successive Galerkin approximation algorithms for nonlinear optimal and robust control , 1998 .

[38]  Frank L. Lewis,et al.  H∞ control of linear discrete-time systems: Off-policy reinforcement learning , 2017, Autom..

[39]  Frank L. Lewis,et al.  Game Theory-Based Control System Algorithms with Real-Time Reinforcement Learning: How to Solve Multiplayer Games Online , 2017, IEEE Control Systems.

[40]  Zongli Lin,et al.  Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control , 2018, Autom..

[41]  Frank L. Lewis,et al.  Integral Reinforcement Learning for online computation of feedback Nash strategies of nonzero-sum differential games , 2010, CDC 2010.

[42]  Tingwen Huang,et al.  Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design , 2014, Autom..

[43]  L LewisFrank,et al.  Multi-player non-zero-sum games , 2011 .

[44]  D. Kleinman On an iterative technique for Riccati equation computations , 1968 .

[45]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[46]  T.-Y. Li,et al.  Lyapunov Iterations for Solving Coupled Algebraic Riccati Equations of Nash Differential Games and Algebraic Riccati Equations of Zero-Sum Games , 1995 .

[47]  Huaguang Zhang,et al.  An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games , 2011, Autom..

[48]  Derong Liu,et al.  Online Synchronous Approximate Optimal Learning Algorithm for Multi-Player Non-Zero-Sum Games With Unknown Dynamics , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[49]  G. Jank,et al.  On Global Existence of Solutions to Coupled Matrix Riccati Equations in Closed Loop , 1996 .

[50]  João P. Hespanha,et al.  Cooperative Q-Learning for Rejection of Persistent Adversarial Inputs in Networked Linear Quadratic Systems , 2018, IEEE Transactions on Automatic Control.

[51]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[52]  Frank L. Lewis,et al.  Adaptive optimal control for continuous-time linear systems based on policy iteration , 2009, Autom..

[53]  Frank L. Lewis,et al.  Discrete-Time Deterministic $Q$ -Learning: A Novel Convergence Analysis , 2017, IEEE Transactions on Cybernetics.

[54]  Dongbin Zhao,et al.  Iterative Adaptive Dynamic Programming for Solving Unknown Nonlinear Zero-Sum Game Based on Online Data , 2017, IEEE Transactions on Neural Networks and Learning Systems.