Online Model Learning Algorithms for Actor-Critic Control

Classical control theory requires a model to be derived for a system, before any control design can take place. This can be a hard, time-consuming process if the system is complex. Moreover, there is no way of escaping modelling errors. As an alternative approach, there is the possibility of having the system learn a controller by itself while it is in operation or offline. Reinforcement learning (RL) is such a framework in which an agent (or controller) optimises its behaviour by interacting with its environment. For continuous state and action spaces, the use of function approximators is a necessity and a commonly used type of RL algorithms for these continuous spaces is the actor-critic algorithm, in which two independent function approximators take the role of the policy (the actor) and the value function (the critic). A main challenge in RL is to use the information gathered during the interaction as efficiently as possible, such that an optimal policy may be reached in a short amount of time. The majority of RL algorithms at each time step measure the state, choose an action corresponding to this state, measure the next state, the corresponding reward and update a value function (and possibly a separate policy). As such, the only source of information used for learning at each time step is the last transition sample. This thesis proposes novel actor-critic methods that aim to shorten the learning time by using every transition sample collected during learning to learn a model of the system online. It also explores the possibility of speeding up learning by providing the agent with explicit knowledge of the reward function.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  M. Ciletti,et al.  The computation and theory of optimal control , 1972 .

[3]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[4]  L. Hasdorff Gradient Optimization and Nonlinear Control , 1976 .

[5]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[6]  Jon Louis Bentley,et al.  Data Structures for Range Searching , 1979, CSUR.

[7]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Gene F. Franklin,et al.  Feedback Control of Dynamic Systems , 1986 .

[9]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[10]  Proceedings of the 1987 winter simulation conference , 1988 .

[11]  C. Watkins Learning from delayed rewards , 1989 .

[12]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[13]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[14]  Oliver G. Selfridge,et al.  Real-time learning: a ball on a beam , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[15]  Vijaykumar Gullapalli,et al.  Learning Control Under Extreme Uncertainty , 1992, NIPS.

[16]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[17]  Reinforcement Learning Architectures , 1992 .

[18]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[19]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[20]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[21]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[22]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[23]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[24]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[25]  Richard S. Sutton,et al.  Model-Based Reinforcement Learning with an Approximate, Learned Model , 1996 .

[26]  Kwang Y. Lee,et al.  An optimal tracking neuro-controller for nonlinear dynamic systems , 1996, IEEE Trans. Neural Networks.

[27]  Ian Postlethwaite,et al.  Multivariable Feedback Control: Analysis and Design , 1996 .

[28]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[29]  Leslie Pack Kaelbling,et al.  Recent Advances in Reinforcement Learning , 1996, Springer US.

[30]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[31]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[32]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[33]  V. Borkar Stochastic approximation with two time scales , 1997 .

[34]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[35]  Benjamin Van Roy,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[36]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[37]  Shun-ichi Amari,et al.  Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[38]  Young-Moon Park,et al.  A receding horizon optimal tracking neurocontroller for nonlinear dynamic systems , 1998 .

[39]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[40]  W. Ames Mathematics in Science and Engineering , 1999 .

[41]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[42]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[43]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[44]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[45]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[46]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[47]  J. Spall STOCHASTIC OPTIMIZATION , 2002 .

[48]  Shigenobu Kobayashi,et al.  Reinforcement learning of walking behavior for a four-legged robot , 2001, Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228).

[49]  Ralf Schoknecht,et al.  Optimality of Reinforcement Learning Algorithms with Linear Function Approximation , 2002, NIPS.

[50]  Jeffrey M. Forbes,et al.  Representations for learning control policies , 2002 .

[51]  J. Bagnell,et al.  Policy search in kernel Hilbert space , 2003 .

[52]  Hamid R. Berenji,et al.  A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters , 2003, IEEE Trans. Fuzzy Syst..

[53]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[54]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[55]  Y. Narahari,et al.  Reinforcement learning applications in dynamic pricing of retail markets , 2003, EEE International Conference on E-Commerce, 2003. CEC 2003..

[56]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[57]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[58]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[59]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[60]  Ming Lu,et al.  Proceedings of the Third International Conference on Machine Learning and Cybernetics , 2004 .

[61]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[62]  Yu-hu Cheng,et al.  Application of actor-critic learning to adaptive state space construction , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[63]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[64]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[65]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[66]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[67]  Nicholas Bambos,et al.  A fuzzy reinforcement learning approach to power control in wireless transmitters , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[68]  Martin A. Riedmiller,et al.  CBR for State Value Function Approximation in Reinforcement Learning , 2005, ICCBR.

[69]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[70]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[71]  Jongho Kim,et al.  An RLS-Based Natural Actor-Critic Algorithm for Locomotion of a Two-Linked Robot Arm , 2005, CIS.

[72]  Tao Xiong,et al.  A combined SVM and LDA approach for classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[73]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[74]  A. Willsky,et al.  Importance sampling actor-critic algorithms , 2006, 2006 American Control Conference.

[75]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[76]  Jin Yu,et al.  Natural Actor-Critic for Road Traffic Optimisation , 2006, NIPS.

[77]  Frank L. Lewis,et al.  Fixed-Final Time Constrained Optimal Control of Nonlinear Systems Using Neural Network HJB Approach , 2006, CDC.

[78]  Javier A. Barria,et al.  Reinforcement Learning for Resource Allocation in LEO Satellite Networks , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[79]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[80]  H. Robbins A Stochastic Approximation Method , 1951 .

[81]  Shin Ishii,et al.  Reinforcement learning for a biped robot based on a CPG-actor-critic method , 2007, Neural Networks.

[82]  Xuesong Wang,et al.  A fuzzy Actor-Critic reinforcement learning network , 2007, Inf. Sci..

[83]  Lyle Noakes,et al.  Continuous-Time Adaptive Critics , 2007, IEEE Transactions on Neural Networks.

[84]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[85]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[86]  Shalabh Bhatnagar,et al.  Simulation-Based Optimization Algorithms for Finite-Horizon Markov Decision Processes , 2008, Simul..

[87]  Manuel Lopes,et al.  Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs , 2008, ECML/PKDD.

[88]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[89]  Junichiro Yoshimoto,et al.  A New Natural Policy Gradient by Stationary Distribution Metric , 2008, ECML/PKDD.

[90]  Kristian Kersting,et al.  Non-parametric policy gradients: a unified treatment of propositional and relational domains , 2008, ICML '08.

[91]  Jan Peters,et al.  Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[92]  Chun-Gui Li,et al.  A Multi-agent Reinforcement Learning using Actor-Critic methods , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[93]  H. Kimura Natural gradient actor-critic algorithms using random rectangular coarse coding , 2008, 2008 SICE Annual Conference.

[94]  Philippe Preux,et al.  Basis Expansion in Natural Actor Critic Methods , 2008, EWRL.

[95]  A Consolidated Actor-Critic Model with Function Approximation for High-Dimensional POMDPs , 2008 .

[96]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[97]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[98]  Ioannis Ch. Paschalidis,et al.  An actor-critic method using Least Squares Temporal Difference learning , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[99]  Wang Meng,et al.  Urban Traffic Signal Learning Control Using Fuzzy Actor-Critic Methods , 2009, 2009 Fifth International Conference on Natural Computation.

[100]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[101]  Abhijit Gosavi,et al.  Reinforcement Learning: A Tutorial Survey and Recent Advances , 2009, INFORMS J. Comput..

[102]  Luigi Fortuna,et al.  Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control , 2009 .

[103]  Frank L. Lewis,et al.  Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2009, 2009 International Joint Conference on Neural Networks.

[104]  Meng Wang,et al.  Urban Traffic Signal Learning Control Using Fuzzy Actor-Critic Methods , 2009, ICNC.

[105]  Marko Grobelnik,et al.  Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II , 2009 .

[106]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[107]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[108]  Junichiro Yoshimoto,et al.  A Generalized Natural Actor-Critic Algorithm , 2009, NIPS.

[109]  Pawel Wawrzynski,et al.  Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[110]  Andres El-Fakdi,et al.  Two steps natural actor critic learning for underwater cable tracking , 2010, 2010 IEEE International Conference on Robotics and Automation.

[111]  A. Gosavi Finite horizon Markov control with one-step variance penalties , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[112]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[113]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[114]  김병찬,et al.  Impedance learning for Robotic Contact Tasks using Natural Actor-Critic Algorithm , 2010 .

[115]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[116]  Junichiro Yoshimoto,et al.  Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning , 2010, Neural Computation.

[117]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[118]  Rafael Castro-Linares,et al.  Trajectory tracking for non-holonomic cars: A linear approach to controlled leader-follower formation , 2010, 49th IEEE Conference on Decision and Control (CDC).

[119]  Shalabh Bhatnagar An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes , 2010, Syst. Control. Lett..

[120]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[121]  Ioannis Ch. Paschalidis,et al.  A Distributed Actor-Critic Algorithm and Applications to Mobile Sensor Network Coordination Problems , 2010, IEEE Transactions on Automatic Control.

[122]  Aude Billard,et al.  Learning Stable Nonlinear Dynamical Systems With Gaussian Mixture Models , 2011, IEEE Transactions on Robotics.

[123]  Peter A. Flach,et al.  Proceedings of the 28th International Conference on Machine Learning , 2011 .

[124]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[125]  Robert Babuska,et al.  Actor-Critic Control with Reference Model Learning , 2011 .

[126]  Fernando José Von Zuben,et al.  A neural architecture to address Reinforcement Learning problems , 2011, The 2011 International Joint Conference on Neural Networks.

[127]  U. Rieder,et al.  Markov Decision Processes with Applications to Finance , 2011 .

[128]  Jason Pazis,et al.  Reinforcement learning in multidimensional continuous action spaces , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[129]  David Barber,et al.  Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes , 2011, ECML/PKDD.

[130]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[131]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[132]  Robert Babuska,et al.  Efficient Model Learning Methods for Actor–Critic Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[133]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[134]  Robert Babuska,et al.  Model learning actor-critic algorithms: Performance evaluation in a motion control task , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[135]  Derong Liu,et al.  Finite-horizon neural optimal tracking control for a class of nonlinear systems with unknown dynamics , 2012, Proceedings of the 10th World Congress on Intelligent Control and Automation.

[136]  Plamen Angelov,et al.  Proceedings of the 2013 International Joint Conference on Neural Networks , 2013 .

[137]  Hao Xu,et al.  Solutions to finite horizon cost problems using actor-critic reinforcement learning , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[138]  Robert Babuska,et al.  Model-free and model-based time-optimal control of a badminton robot , 2013, 2013 9th Asian Control Conference (ASCC).

[139]  Robert Babuska,et al.  Comparison of model-free and model-based methods for time optimal hit control of a badminton robot , 2014 .

[140]  Robert Babuska,et al.  Learning rate free reinforcement learning for real-time motion control using a value-gradient based policy , 2014 .

[141]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .

[142]  Fuzzy Logic in Control Systems : Fuzzy Logic , 2022 .