Optimal and Autonomous Control Using Reinforcement Learning: A Survey

This paper reviews the current state of the art on reinforcement learning (RL)-based feedback control solutions to optimal regulation and tracking of single and multiagent systems. Existing RL solutions to both optimal <inline-formula> <tex-math notation="LaTeX">$\mathcal {H}_{2}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\mathcal {H}_\infty $ </tex-math></inline-formula> control problems, as well as graphical games, will be reviewed. RL methods learn the solution to optimal control and game problems online and using measured data along the system trajectories. We discuss Q-learning and the integral RL algorithm as core algorithms for discrete-time (DT) and continuous-time (CT) systems, respectively. Moreover, we discuss a new direction of off-policy RL for both CT and DT systems. Finally, we review several applications.

[1]  Anup Parikh,et al.  Adaptive control of a surface marine craft with parameter identification using integral concurrent learning , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[2]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[3]  Marcus Johnson,et al.  Approximate $N$ -Player Nonzero-Sum Game Solution for an Uncertain Continuous Nonlinear System , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[5]  S. Sastry,et al.  Adaptive Control: Stability, Convergence and Robustness , 1989 .

[6]  Warren E. Dixon,et al.  Model-based reinforcement learning for approximate optimal regulation , 2016, Autom..

[7]  Ales Ude,et al.  Programming full-body movements for humanoid robots by observation , 2004, Robotics Auton. Syst..

[8]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[9]  Min Guo,et al.  Reinforcement Learning Neural Network to the Problem of Autonomous Mobile Robot Obstacle Avoidance , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[10]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[11]  Nguyen Tan Luy,et al.  Reinforecement learning-based optimal tracking control for wheeled mobile robot , 2012, 2012 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[12]  Frank L. Lewis,et al.  Game-Theoretic Control of Active Loads in DC Microgrids , 2016, IEEE Transactions on Energy Conversion.

[13]  Paul J. Werbos,et al.  Approximate dynamic programming for real-time control and neural modeling , 1992 .

[14]  Zhong-Ping Jiang,et al.  Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics , 2012, Autom..

[15]  Frank L. Lewis,et al.  Optimal Tracking Control of Uncertain Systems , 2016 .

[16]  Jingyuan Zhang,et al.  Application of Artificial Neural Network Based on Q-learning for Mobile Robot Path Planning , 2006, 2006 IEEE International Conference on Information Acquisition.

[17]  Frank L. Lewis,et al.  2009 Special Issue: Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems , 2009 .

[18]  F. Lewis,et al.  Model-free Q-learning designs for discrete-time zero-sum games with application to H-infinity control , 2007, 2007 European Control Conference (ECC).

[19]  Zhicong Huang,et al.  Adaptive impedance control of robotic exoskeletons using reinforcement learning , 2016, 2016 International Conference on Advanced Robotics and Mechatronics (ICARM).

[20]  Xi-Ren Cao Stochastic Learning and Optimization , 2007 .

[21]  Derong Liu,et al.  Adaptive Dynamic Programming for Control , 2012 .

[22]  Nguyen Tan Luy,et al.  Reinforcement learning-based robust adaptive tracking control for multi-wheeled mobile robots synchronization with optimality , 2013, 2013 IEEE Workshop on Robotic Intelligence in Informationally Structured Space (RiiSS).

[23]  Kyriakos G. Vamvoudakis,et al.  Online Optimal Operation of Parallel Voltage-Source Inverters Using Partial Information , 2017, IEEE Transactions on Industrial Electronics.

[24]  John E. Laird,et al.  Learning procedural knowledge through observation , 2001, K-CAP '01.

[25]  Frank L. Lewis,et al.  Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems , 2014, Autom..

[26]  Marcin Szuster,et al.  Discrete Globalised Dual Heuristic Dynamic Programming in Control of the Two-Wheeled Mobile Robot , 2014 .

[27]  Yanhong Luo,et al.  Approximate optimal control for a class of nonlinear discrete-time systems with saturating actuators , 2008 .

[28]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[29]  Drs Sustainment Online optimal control of nonlinear discrete-time systems using approximate dynamic programming , 2011 .

[30]  Hao Xu,et al.  Neural network‐based finite horizon optimal adaptive consensus control of mobile robot formations , 2016 .

[31]  Stefan Schaal,et al.  Reinforcement Learning With Sequences of Motion Primitives for Robust Manipulation , 2012, IEEE Transactions on Robotics.

[32]  Petros A. Ioannou,et al.  Adaptive Control Tutorial (Advances in Design and Control) , 2006 .

[33]  Haibo He,et al.  Event-Triggered Adaptive Dynamic Programming for Continuous-Time Systems With Control Constraints , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[34]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[35]  Hari Om Gupta,et al.  Application of policy iteration technique based adaptive optimal control design for automatic voltage regulator of power system , 2014 .

[36]  Frank L. Lewis,et al.  $ {H}_{ {\infty }}$ Tracking Control of Completely Unknown Continuous-Time Systems via Off-Policy Reinforcement Learning , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[38]  Y. Matsuoka,et al.  Reinforcement Learning and Synergistic Control of the ACT Hand , 2013, IEEE/ASME Transactions on Mechatronics.

[39]  Frank L. Lewis,et al.  Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics , 2014, Autom..

[40]  Girish Chowdhary,et al.  Concurrent learning adaptive control of linear systems with exponentially convergent bounds , 2013 .

[41]  Sarangapani Jagannathan,et al.  Approximate optimal distributed control of uncertain nonlinear interconnected systems with event-sampled feedback , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[42]  Robert Kozma,et al.  Complete stability analysis of a heuristic approximate dynamic programming control design , 2015, Autom..

[43]  Sarangapani Jagannathan,et al.  Distributed adaptive optimal regulation of uncertain large-scale interconnected systems using hybrid Q-learning approach , 2016 .

[44]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[45]  Frank L. Lewis,et al.  Linear Quadratic Tracking Control of Partially-Unknown Continuous-Time Systems Using Reinforcement Learning , 2014, IEEE Transactions on Automatic Control.

[46]  Mathew Mithra Noel,et al.  Nonlinear control of a boost converter using a robust regression based reinforcement learning algorithm , 2016, Eng. Appl. Artif. Intell..

[47]  Avimanyu Sahoo,et al.  Near Optimal Event-Triggered Control of Nonlinear Discrete-Time Systems Using Neurodynamic Programming , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[48]  Frank L. Lewis,et al.  A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems , 2013, Autom..

[49]  Xiaogang Ruan,et al.  Application of reinforcement learning based on neural network to dynamic obstacle avoidance , 2008, 2008 International Conference on Information and Automation.

[50]  Luigi Fortuna,et al.  Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control , 2009 .

[51]  Frank L. Lewis,et al.  Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[52]  Frank L. Lewis,et al.  Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning , 2014, Autom..

[53]  Naira Hovakimyan,et al.  L1 Adaptive Control Theory - Guaranteed Robustness with Fast Adaptation , 2010, Advances in design and control.

[54]  Ali Heydari,et al.  Feedback Solution to Optimal Switching Problems With Switching Cost , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[55]  Ruey-Wen Liu,et al.  Construction of Suboptimal Control Sequences , 1967 .

[56]  R Bellman,et al.  DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Gang Tao,et al.  Adaptive Control Design and Analysis , 2003 .

[58]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[59]  Ali Heydari,et al.  Optimal Switching and Control of Nonlinear Switching Systems Using Approximate Dynamic Programming , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[60]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[61]  Frank L. Lewis,et al.  Adaptive optimal control for continuous-time linear systems based on policy iteration , 2009, Autom..

[62]  Ali Heydari,et al.  Optimal scheduling for reference tracking or state regulation using reinforcement learning , 2015, J. Frankl. Inst..

[63]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[64]  Huai‐Ning Wu,et al.  Computationally efficient simultaneous policy update algorithm for nonlinear H∞ state feedback control with Galerkin's method , 2013 .

[65]  K. Vamvoudakis Event-triggered optimal adaptive control algorithm for continuous-time nonlinear systems , 2014, IEEE/CAA Journal of Automatica Sinica.

[66]  Martin T. Hagan,et al.  Neural network design , 1995 .

[67]  A. Billard,et al.  Effects of repeated exposure to a humanoid robot on children with autism , 2004 .

[68]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[69]  George M. Siouris,et al.  Applied Optimal Control: Optimization, Estimation, and Control , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[70]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[71]  Paul J. Webros A menu of designs for reinforcement learning over time , 1990 .

[72]  Ahmad Ghanbari,et al.  Neural Network Reinforcement Learning for Walking Control of a 3-Link Biped Robot , 2015 .

[73]  Tingwen Huang,et al.  Off-Policy Reinforcement Learning for $ H_\infty $ Control Design , 2013, IEEE Transactions on Cybernetics.

[74]  Daniel Liberzon,et al.  Calculus of Variations and Optimal Control Theory: A Concise Introduction , 2012 .

[75]  Frank L. Lewis,et al.  Multi-agent differential graphical games , 2011, Proceedings of the 30th Chinese Control Conference.

[76]  Frank L. Lewis,et al.  Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[77]  G. Zames Feedback and optimal sensitivity: Model reference transformations, multiplicative seminorms, and approximate inverses , 1981 .

[78]  William M. McEneaney,et al.  A Max-Plus-Based Algorithm for a Hamilton--Jacobi--Bellman Equation of Nonlinear Filtering , 2000, SIAM J. Control. Optim..

[79]  Haibo He,et al.  Adaptive Event-Triggered Control Based on Heuristic Dynamic Programming for Nonlinear Discrete-Time Systems , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[80]  W. Dixon Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles , 2014 .

[81]  Frank L. Lewis,et al.  H∞ control of linear discrete-time systems: Off-policy reinforcement learning , 2017, Autom..

[82]  Derong Liu,et al.  Integral Reinforcement Learning for Linear Continuous-Time Zero-Sum Games With Completely Unknown Dynamics , 2014, IEEE Transactions on Automation Science and Engineering.

[83]  S. Kahne,et al.  Optimal control: An introduction to the theory and ITs applications , 1967, IEEE Transactions on Automatic Control.

[84]  Ali Heydari,et al.  Optimal switching between autonomous subsystems , 2014, J. Frankl. Inst..

[85]  Frank L. Lewis,et al.  Cooperative Control of Multi-Agent Systems: Optimal and Adaptive Design Approaches , 2013 .

[86]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[87]  P. Khargonekar,et al.  STATESPACE SOLUTIONS TO STANDARD 2 H AND H? CONTROL PROBLEMS , 1989 .

[88]  Zhong-Ping Jiang,et al.  Adaptive dynamic programming and optimal control of nonlinear nonaffine systems , 2014, Autom..

[89]  Frank L. Lewis,et al.  H∞ Control of Nonaffine Aerial Systems Using Off-policy Reinforcement Learning , 2016, Unmanned Syst..

[90]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[91]  Tamer Başar,et al.  H1-Optimal Control and Related Minimax Design Problems , 1995 .

[92]  Warren B. Powell,et al.  Approximate Dynamic Programming I: Modeling , 2011 .

[93]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[94]  Warren E. Dixon,et al.  Online Approximate Optimal Station Keeping of an Autonomous Underwater Vehicle , 2013, ArXiv.

[95]  Sarangapani Jagannathan,et al.  Distributed event-sampled approximate optimal control of interconnected affine nonlinear continuous-time systems , 2016, 2016 American Control Conference (ACC).

[96]  Ali Heydari,et al.  Optimal Switching of DC–DC Power Converters Using Approximate Dynamic Programming , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[97]  Chrystopher L. Nehaniv,et al.  Teaching robots by moulding behavior and scaffolding the environment , 2006, HRI '06.

[98]  Cheng-Wan An,et al.  Mobile robot navigation using neural Q-learning , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[99]  Frank L. Lewis,et al.  Actor–Critic-Based Optimal Tracking for Partially Unknown Nonlinear Discrete-Time Systems , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[100]  Frank L. Lewis,et al.  Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , 2005, Autom..

[101]  Warren E. Dixon,et al.  Model-based reinforcement learning for on-line feedback-Nash equilibrium solution of N-player nonzero-sum differential games , 2014, 2014 American Control Conference.

[102]  Thorsten Joachims,et al.  Learning Trajectory Preferences for Manipulators via Iterative Improvement , 2013, NIPS.

[103]  Kyriakos G. Vamvoudakis,et al.  Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems , 2015, Autom..

[104]  Yu Jiang,et al.  Robust Adaptive Dynamic Programming and Feedback Stabilization of Nonlinear Systems , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[105]  Frank L. Lewis,et al.  Optimal Tracking Control of Unknown Discrete-Time Linear Systems Using Input-Output Measured Data , 2015, IEEE Transactions on Cybernetics.

[106]  William M. McEneaney,et al.  Max-plus methods for nonlinear control and estimation , 2005 .

[107]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[108]  Chrystopher L. Nehaniv,et al.  Imitation with ALICE: learning to imitate corresponding actions across dissimilar embodiments , 2002, IEEE Trans. Syst. Man Cybern. Part A.

[109]  Robert Babuška,et al.  Policy Derivation Methods for Critic-Only Reinforcement Learning in Continuous Action Spaces , 2016 .

[110]  A. Schaft L/sub 2/-gain analysis of nonlinear systems and nonlinear state-feedback H/sub infinity / control , 1992 .

[111]  Frank L. Lewis,et al.  Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2009, 2009 International Joint Conference on Neural Networks.

[112]  Ali Heydari,et al.  Optimal switching between controlled subsystems with free mode sequence , 2015, Neurocomputing.

[113]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[114]  Robert Babuska,et al.  Actor-critic reinforcement learning for tracking control in robotics , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).