Reinforcement learning and optimal adaptive control: An overview and implementation examples

Abstract This paper provides an overview of the reinforcement learning and optimal adaptive control literature and its application to robotics. Reinforcement learning is bridging the gap between traditional optimal control, adaptive control and bio-inspired learning techniques borrowed from animals. This work is highlighting some of the key techniques presented by well known researchers from the combined areas of reinforcement learning and optimal control theory. At the end, an example of an implementation of a novel model-free Q-learning based discrete optimal adaptive controller for a humanoid robot arm is presented. The controller uses a novel adaptive dynamic programming (ADP) reinforcement learning (RL) approach to develop an optimal policy on-line. The RL joint space tracking controller was implemented for two links (shoulder flexion and elbow flexion joints) of the arm of the humanoid Bristol-Elumotion-Robotic-Torso II (BERT II) torso. The constrained case (joint limits) of the RL scheme was tested for a single link (elbow flexion) of the BERT II arm by modifying the cost function to deal with the extra nonlinearity due to the joint constraints.

[1]  P.J. Werbos,et al.  Using ADP to Understand and Replicate Brain Intelligence: the Next Level Design , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[2]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[3]  Leslie Pack Kaelbling,et al.  Reinforcement learning for robot control , 2002, SPIE Optics East.

[4]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[5]  Sebastian Thrun,et al.  A Review of Reinforcement Learning , 2000, AI Mag..

[6]  Frank L. Lewis,et al.  Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[7]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[8]  Frank L. Lewis,et al.  Adaptive optimal control for continuous-time linear systems based on policy iteration , 2009, Autom..

[9]  Richard S. Sutton,et al.  Reinforcement Learning is Direct Adaptive Optimal Control , 1992, 1991 American Control Conference.

[10]  Sanjay Sharma,et al.  Application of Soft Computing Techniques to a LQG Controller Design , 2008 .

[11]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[12]  Radoslaw Romuald Zakrzewski,et al.  Neural network control of nonlinear discrete time systems , 1994 .

[13]  Murad Abu-Khalaf,et al.  Nonlinear H2/H∞ Constrained Feedback Control: A Practical Design Approach Using Neural Networks , 2007 .

[14]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[15]  Philippe Preux,et al.  Recent Advances in Reinforcement Learning , 2008, Lecture Notes in Computer Science.

[16]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[17]  Valery Kuzmin CONNECTIONIST Q-LEARNING IN ROBOT CONTROL TASK , 2002 .

[18]  Bart De Schutter,et al.  Multi-Agent Reinforcement Learning: A Survey , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[19]  Frank L. Lewis,et al.  A Q-learning based Cartesian model reference compliance controller implementation for a humanoid robot arm , 2011, 2011 IEEE 5th International Conference on Robotics, Automation and Mechatronics (RAM).

[20]  Luigi Fortuna,et al.  Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control , 2009 .

[21]  Benjamin Van Roy,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[22]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[23]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[24]  Wolfram Burgard,et al.  Robotics: Science and Systems XV , 2010 .

[25]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming: An Overview and Recent Results , 2006, OR.

[26]  Frank L. Lewis,et al.  Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[27]  D. Ernst,et al.  Approximate Value Iteration in the Reinforcement Learning Context. Application to Electrical Power System Control. , 2005 .

[28]  Jennie Si,et al.  Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence) , 2004 .

[29]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[30]  Philippe Preux,et al.  Basis Expansion in Natural Actor Critic Methods , 2008, EWRL.

[31]  Martin A. Riedmiller,et al.  Reinforcement learning in feedback control , 2011, Machine Learning.

[32]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC , 2005, Eur. J. Control.

[33]  Matthieu Geist,et al.  Tracking in Reinforcement Learning , 2009, ICONIP.

[34]  Frank L. Lewis,et al.  Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , 2005, Autom..

[35]  Frank L. Lewis,et al.  A Novel Q-Learning Based Adaptive Optimal Controller Implementation for a Humanoid Robotic Arm , 2011 .

[36]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[37]  Lihong Li,et al.  Online exploration in least-squares policy iteration , 2009, AAMAS.

[38]  Paul J. Werbos,et al.  Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[39]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[40]  Kuu-young Young,et al.  Reinforcement Learning and Robust Control for Robot Compliance Tasks , 1998, J. Intell. Robotic Syst..

[41]  Richard S. Sutton,et al.  Connectionist Learning for Control , 1995 .

[42]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[43]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[44]  Andrew G. Barto,et al.  Connectionist learning for control: an overview , 1990 .

[45]  Chris Gaskett,et al.  Q-Learning for Robot Control , 2002 .

[46]  Kaspar Althoefer,et al.  Reinforcement learning in a rule-based navigator for robotic manipulators , 2001, Neurocomputing.

[47]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[48]  Sungchul Kang,et al.  Learning robot stiffness for contact tasks using the natural actor-critic , 2008, 2008 IEEE International Conference on Robotics and Automation.

[49]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50]  Matthieu Geist,et al.  Sample Efficient On-Line Learning of Optimal Dialogue Policies with Kalman Temporal Differences , 2011, IJCAI.

[51]  Richard S. Sutton,et al.  Reinforcement Learning: Past, Present and Future , 1998, SEAL.

[52]  Sungchul Kang,et al.  Impedance Learning for Robotic Contact Tasks Using Natural Actor-Critic Algorithm , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[53]  E.V. Kampen,et al.  Online Adaptive Critic Flight Control using Approximated Plant Dynamics , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[54]  Jan Peters,et al.  Reinforcement learning for optimal control of arm movements , 2007 .

[55]  Shalabh Bhatnagar,et al.  Natural actorcritic algorithms. , 2009 .

[56]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[57]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[58]  Frank L. Lewis,et al.  Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2010, Autom..

[59]  Anthony G. Pipe,et al.  An Architecture for Learning "Potential Field" Cognitive Maps with an Application to Mobile Robotics , 2000, Adapt. Behav..

[60]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[61]  Stefan Schaal,et al.  Learning tasks from a single demonstration , 1997, Proceedings of International Conference on Robotics and Automation.

[62]  Guido Herrmann,et al.  Safe Adaptive Compliance Control of a Humanoid Robotic Arm with Anti-Windup Compensation and Posture Control , 2010, Int. J. Soc. Robotics.

[63]  Mohamed A. Zohdy,et al.  Reinforcement learning control of nonlinear multi-link system , 2001 .

[64]  D. Bertsekas Dynamic Programming and Suboptimal Control: From ADP to MPC , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[65]  Matthieu Geist,et al.  Revisiting Natural Actor-Critics with Value Function Approximation , 2010, MDAI.

[66]  Frank L. Lewis,et al.  Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2009, 2009 International Joint Conference on Neural Networks.

[67]  Frank L. Lewis,et al.  Robot Manipulator Control: Theory and Practice , 2003 .

[68]  Frank L. Lewis,et al.  Adaptive dynamic programming applied to a 6DoF quadrotor , 2011 .

[69]  Christian Igel,et al.  Reinforcement learning in a nutshell , 2007, ESANN.

[70]  Martin A. Riedmiller,et al.  Reinforcement learning for robot soccer , 2009, Auton. Robots.

[71]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[72]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[73]  Stefan Schaal,et al.  Variable Impedance Control - A Reinforcement Learning Approach , 2010, Robotics: Science and Systems.

[74]  Toshiyuki Kondo,et al.  Biological robot arm motion through reinforcement learning , 2002, Proceedings of the 41st SICE Annual Conference. SICE 2002..

[75]  Dimitri P. Bertsekas,et al.  Pathologies of temporal difference methods in approximate dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[76]  Stefan Schaal,et al.  Learning to Control in Operational Space , 2008, Int. J. Robotics Res..

[77]  Yoav Shoham,et al.  Multi-Agent Reinforcement Learning:a critical survey , 2003 .

[78]  Meng Joo Er,et al.  Real-time dynamic fuzzy Q-learning and control of mobile robots , 2004, 2004 5th Asian Control Conference (IEEE Cat. No.04EX904).

[79]  Paul J. Werbos,et al.  Foreword: ADP - The Key Direction for Future Research in Intelligent Control and Understanding Brain Intelligence , 2008, IEEE Trans. Syst. Man Cybern. Part B.

[80]  O. Kulyba,et al.  Reinforcement Learning Interfaces for Biomedical Database Systems , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[81]  Hui Peng,et al.  A Survey of Approximate Dynamic Programming , 2009, 2009 International Conference on Intelligent Human-Machine Systems and Cybernetics.

[82]  Greg Welch,et al.  Welch & Bishop , An Introduction to the Kalman Filter 2 1 The Discrete Kalman Filter In 1960 , 1994 .

[83]  H. Kappen An introduction to stochastic control theory, path integrals and reinforcement learning , 2007 .

[84]  P. Werbos 1 ADP : Goals , Opportunities and Principles , 2022 .

[85]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[86]  Damien Ernst,et al.  Using prior knowledge to accelerate online least-squares policy iteration , 2010, 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR).

[87]  Steven J. Bradtke,et al.  Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[88]  Anthony G. Pipe,et al.  Towards Safe Human-Robot Interaction , 2011, TAROS.

[89]  Martin A. Riedmiller,et al.  Neural Reinforcement Learning Controllers for a Real Robot Application , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[90]  Dimitri P. Bertsekas,et al.  Temporal Difference Methods for General Projected Equations , 2011, IEEE Transactions on Automatic Control.

[91]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[92]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[93]  Paul J. Werbos,et al.  2009 Special Issue: Intelligence in the brain: A theory of how it works and how to build it , 2009 .

[94]  Sue Ellen Haupt,et al.  Artificial Intelligence Methods in the Environmental Sciences , 2008 .

[95]  Bernard Muschielok,et al.  The 4MOST instrument concept overview , 2014, Astronomical Telescopes and Instrumentation.

[96]  Bart De Schutter,et al.  Online least-squares policy iteration for reinforcement learning control , 2010, Proceedings of the 2010 American Control Conference.

[97]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[98]  Frank L. Lewis,et al.  Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control , 2007, Autom..

[99]  Matthieu Geist,et al.  Kalman Temporal Differences , 2010, J. Artif. Intell. Res..

[100]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[101]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[102]  Hitesh Shah,et al.  Reinforcement learning control of robot manipulators in uncertain environments , 2009, 2009 IEEE International Conference on Industrial Technology.

[103]  John K. Williams,et al.  Reinforcement Learning of Optimal Controls , 2009 .

[104]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[105]  Jennie Si,et al.  ADP: Goals, Opportunities and Principles , 2004 .

[106]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[107]  Guido Herrmann,et al.  Adaptive multi-dimensional compliance control of a humanoid robotic arm with anti-windup compensation , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[108]  Mohamed A. Zohdy,et al.  Application of reinforcement learning control to a nonlinear dexterous robot , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[109]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[110]  Paul J. Werbos,et al.  Approximate dynamic programming for real-time control and neural modeling , 1992 .

[111]  Frank L. Lewis,et al.  2009 Special Issue: Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems , 2009 .

[112]  F. Lewis,et al.  Model-free Q-learning designs for discrete-time zero-sum games with application to H-infinity control , 2007, 2007 European Control Conference (ECC).

[113]  B. L. Digney Nested Q-learning of hierarchical control structures , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[114]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[115]  Wei Liu,et al.  Enhanced Q-learning algorithm for dynamic power management with performance constraint , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[116]  J. Hespanha,et al.  Forecasting COVID-19 cases based on a parameter-varying stochastic SIR model , 2019, Annual Reviews in Control.