Discrete-Time Deterministic $Q$ -Learning: A Novel Convergence Analysis

In this paper, a novel discrete-time deterministic <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula>-learning algorithm is developed. In each iteration of the developed <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula>-learning algorithm, the iterative <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula> function is updated for all the state and control spaces, instead of updating for a single state and a single control in traditional <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula>-learning algorithm. A new convergence criterion is established to guarantee that the iterative <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula> function converges to the optimum, where the convergence criterion of the learning rates for traditional <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula>-learning algorithms is simplified. During the convergence analysis, the upper and lower bounds of the iterative <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula> function are analyzed to obtain the convergence criterion, instead of analyzing the iterative <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula> function itself. For convenience of analysis, the convergence properties for undiscounted case of the deterministic <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula>-learning algorithm are first developed. Then, considering the discounted factor, the convergence criterion for the discounted case is established. Neural networks are used to approximate the iterative <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula> function and compute the iterative control law, respectively, for facilitating the implementation of the deterministic <inline-formula> <tex-math notation="LaTeX">$ Q$ </tex-math></inline-formula>-learning algorithm. Finally, simulation results and comparisons are given to illustrate the performance of the developed algorithm.

[1]  Kyriakos G. Vamvoudakis,et al.  Asymptotically Stable Adaptive–Optimal Control Algorithm With Saturating Actuators and Relaxed Persistence of Excitation , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Frank L. Lewis,et al.  Model-free H∞ control design for unknown linear discrete-time systems via Q-learning with LMI , 2010, Autom..

[3]  Huaguang Zhang,et al.  Adaptive Dynamic Programming for a Class of Complex-Valued Nonlinear Systems , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Dimitri P. Bertsekas,et al.  Value and Policy Iterations in Optimal Control and Adaptive Dynamic Programming , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Huaguang Zhang,et al.  Near-Optimal Control for Nonzero-Sum Differential Games of Continuous-Time Nonlinear Systems Using Single-Network ADP , 2013, IEEE Transactions on Cybernetics.

[6]  Habib Rajabi Mashhadi,et al.  An Adaptive $Q$-Learning Algorithm Developed for Agent-Based Computational Modeling of Electricity Market , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  Zhong-Ping Jiang,et al.  Global Adaptive Dynamic Programming for Continuous-Time Nonlinear Systems , 2013, IEEE Transactions on Automatic Control.

[8]  Jay H. Lee,et al.  Approximate dynamic programming-based approaches for input-output data-driven control of nonlinear processes , 2005, Autom..

[9]  Shalabh Bhatnagar,et al.  Reinforcement Learning With Function Approximation for Traffic Signal Control , 2011, IEEE Transactions on Intelligent Transportation Systems.

[10]  Xiangnan Zhong,et al.  An Event-Triggered ADP Control Approach for Continuous-Time System With Unknown Internal States. , 2017 .

[11]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[12]  Derong Liu,et al.  Data-Driven Neuro-Optimal Temperature Control of Water–Gas Shift Reaction Using Stable Iterative Adaptive Dynamic Programming , 2014, IEEE Transactions on Industrial Electronics.

[13]  Frank L. Lewis,et al.  Linear Quadratic Tracking Control of Partially-Unknown Continuous-Time Systems Using Reinforcement Learning , 2014, IEEE Transactions on Automatic Control.

[14]  Frank L. Lewis,et al.  Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control , 2007, Autom..

[15]  Bo Lincoln,et al.  Relaxing dynamic programming , 2006, IEEE Transactions on Automatic Control.

[16]  Haibo He,et al.  A Novel Energy Function-Based Stability Evaluation and Nonlinear Control Approach for Energy Internet , 2017, IEEE Transactions on Smart Grid.

[17]  Jianwei Zhang,et al.  A Survey on CPG-Inspired Control Models and System Implementation , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Haibo He,et al.  Goal Representation Heuristic Dynamic Programming on Maze Navigation , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Derong Liu,et al.  Finite-Approximation-Error-Based Discrete-Time Iterative Adaptive Dynamic Programming , 2014, IEEE Transactions on Cybernetics.

[20]  Ali Heydari,et al.  Revisiting Approximate Dynamic Programming and its Convergence , 2014, IEEE Transactions on Cybernetics.

[21]  Frank L. Lewis,et al.  Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems , 2014, Autom..

[22]  Derong Liu,et al.  Adaptive Dynamic Programming for Optimal Tracking Control of Unknown Nonlinear Systems With Application to Coal Gasification , 2014, IEEE Transactions on Automation Science and Engineering.

[23]  Li Ren,et al.  A Multiagent Q-Learning-Based Optimal Allocation Approach for Urban Water Resource Management System , 2014, IEEE Transactions on Automation Science and Engineering.

[24]  Shaocheng Tong,et al.  A Unified Approach to Adaptive Neural Control for Nonlinear Discrete-Time Systems With Nonlinear Dead-Zone Input , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[25]  Derong Liu,et al.  A self-learning scheme for residential energy system control and management , 2013, Neural Computing and Applications.

[26]  Frank L. Lewis,et al.  Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning , 2014, Autom..

[27]  Frank L. Lewis,et al.  Off-Policy Actor-Critic Structure for Optimal Control of Unknown Systems With Disturbances , 2016, IEEE Transactions on Cybernetics.

[28]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[29]  Sarangapani Jagannathan,et al.  Online Optimal Control of Affine Nonlinear Discrete-Time Systems With Unknown Internal Dynamics by Using Time-Based Policy Update , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Richard S. Sutton,et al.  A Menu of Designs for Reinforcement Learning Over Time , 1995 .

[31]  Haibo He,et al.  GrDHP: A General Utility Function Representation for Dual Heuristic Dynamic Programming , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[32]  Derong Liu,et al.  Infinite Horizon Self-Learning Optimal Control of Nonaffine Discrete-Time Nonlinear Systems , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[33]  Huaguang Zhang,et al.  Distributed Cooperative Optimal Control for Multiagent Systems on Directed Graphs: An Inverse Optimal Approach , 2015, IEEE Transactions on Cybernetics.

[34]  F. Lewis,et al.  Reinforcement Learning and Feedback Control: Using Natural Decision Methods to Design Optimal Adaptive Controllers , 2012, IEEE Control Systems.

[35]  Amit Konar,et al.  A Deterministic Improved Q-Learning for Path Planning of a Mobile Robot , 2013, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[36]  Frank L. Lewis,et al.  Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[37]  Derong Liu,et al.  Value Iteration Adaptive Dynamic Programming for Optimal Control of Discrete-Time Nonlinear Systems , 2016, IEEE Transactions on Cybernetics.

[38]  Hao Xu,et al.  Stochastic optimal control of unknown linear networked control system in the presence of random delays and packet losses , 2012, Autom..

[39]  Shaocheng Tong,et al.  Reinforcement Learning Design-Based Adaptive Tracking Control With Less Learning Parameters for Nonlinear Discrete-Time MIMO Systems , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[40]  Josep M. Guerrero,et al.  Hybrid Three-Phase/Single-Phase Microgrid Architecture With Power Management Capabilities , 2015, IEEE Transactions on Power Electronics.

[41]  Qinglai Wei,et al.  Data-Driven Zero-Sum Neuro-Optimal Control for a Class of Continuous-Time Unknown Nonlinear Systems With Disturbance Using ADP , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[42]  Derong Liu,et al.  A Novel Dual Iterative $Q$-Learning Method for Optimal Battery Management in Smart Residential Environments , 2015, IEEE Transactions on Industrial Electronics.

[43]  Qinglai Wei,et al.  A Novel Iterative $\theta $-Adaptive Dynamic Programming for Discrete-Time Nonlinear Systems , 2014, IEEE Transactions on Automation Science and Engineering.

[44]  Derong Liu,et al.  Model-Free Adaptive Dynamic Programming for Optimal Control of Discrete-Time Ane Nonlinear System , 2014 .

[45]  Huaguang Zhang,et al.  A Novel Infinite-Time Optimal Tracking Control Scheme for a Class of Discrete-Time Nonlinear Systems via the Greedy HDP Iteration Algorithm , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[46]  Paul J. Webros A menu of designs for reinforcement learning over time , 1990 .

[47]  Shaocheng Tong,et al.  Adaptive NN Tracking Control of Uncertain Nonlinear Discrete-Time Systems With Nonaffine Dead-Zone Input , 2015, IEEE Transactions on Cybernetics.

[48]  Hao Xu,et al.  Finite-horizon near optimal adaptive control of uncertain linear discrete-time systems , 2015 .

[49]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[50]  Derong Liu,et al.  Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[51]  Frank L. Lewis,et al.  Adaptive Optimal Control of Unknown Constrained-Input Systems Using Policy Iteration and Neural Networks , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[52]  Derong Liu,et al.  Multibattery Optimal Coordination Control for Home Energy Management Systems via Distributed Iterative Adaptive Dynamic Programming , 2015, IEEE Transactions on Industrial Electronics.

[53]  Josep M. Guerrero,et al.  A Multiagent-Based Consensus Algorithm for Distributed Coordinated Control of Distributed Generators in the Energy Internet , 2015, IEEE Transactions on Smart Grid.

[54]  H. Vincent Poor,et al.  QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations , 2012, IEEE Trans. Signal Process..

[55]  Frank L. Lewis,et al.  Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton-Jacobi equations , 2011, Autom..

[56]  Frank L. Lewis,et al.  Stochastic Optimal Design for Unknown Linear Discrete‐Time System Zero‐Sum Games in Input‐Output form Under Communication Constraints , 2014 .

[57]  Zhong-Ping Jiang,et al.  Adaptive dynamic programming and optimal control of nonlinear nonaffine systems , 2014, Autom..

[58]  Frank L. Lewis,et al.  Multiple Actor-Critic Structures for Continuous-Time Optimal Control Using Input-Output Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.