Reinforcement learning in continuous state- and action-space

Reinforcement learning in the continuous state-space poses the problem of the inability to store the values of all state-action pairs in a lookup table, due to both storage limitations and the inability to visit all states sufficiently often to learn the correct values. This can be overcome with the use of function approximation techniques with generalisation capability, such as artificial neural networks, to store the value function. When this is applied we can select the optimal action by comparing the values of each possible action; however, when the action-space is continuous this is not possible. In this thesis we investigate methods to select the optimal action when artificial neural networks are used to approximate the value function, through the application of numerical optimization techniques. Although it has been stated in the literature that gradient-ascent methods can be applied to the action selection [47], it is also stated that solving this problem would be infeasible, and therefore, is claimed that it is necessary to utilise a second artificial neural network to approximate the policy function [21, 55]. The major contributions of this thesis include the investigation of the applicability of action selection by numerical optimization methods, including gradient-ascent along with other derivative-based and derivative-free numerical optimization methods,and the proposal of two novel algorithms which are based on the application of two alternative action selection methods: NM-SARSA [40] and NelderMead-SARSA. We empirically compare the proposed methods to state-of-the-art methods from the literature on three continuous state- and action-space control benchmark problems from the literature: minimum-time full swing-up of the Acrobot; Cart-Pole balancing problem; and a double pole variant. We also present novel results from the application of the existing direct policy search method genetic programming to the Acrobot benchmark problem [12, 14].

[1]  Michail G. Lagoudakis,et al.  Learning continuous-action control policies , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[2]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[3]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[4]  Jennie Si,et al.  Online learning control by association and reinforcement. , 2001, IEEE transactions on neural networks.

[5]  Larry D. Pyeatt,et al.  A comparison between cellular encoding and direct encoding for genetic neural networks , 1996 .

[6]  Leemon C Baird,et al.  Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[7]  Jeffrey C. Lagarias,et al.  Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions , 1998, SIAM J. Optim..

[8]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[9]  Jonathan Baxter,et al.  Reinforcement Learning From State and Temporal Differences , 1999 .

[10]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[11]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[12]  Dimitris C. Dracopoulos,et al.  Genetic Programming for Generalised Helicopter Hovering Control , 2012, EuroGP.

[13]  Dimitris C. Dracopoulos,et al.  Swing Up and Balance Control of the Acrobot Solved by Genetic Programming , 2012, SGAI Conf..

[14]  Mark W. Spong,et al.  The swing up control problem for the Acrobot , 1995 .

[15]  Dimitris C. Dracopoulos,et al.  Application of Newton's Method to action selection in continuous state- and action-space reinforcement learning , 2014, ESANN.

[16]  A. P. Wieland,et al.  Evolving neural network controllers for unstable systems , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[17]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[18]  Christian Igel,et al.  Evolution Strategies for Direct Policy Search , 2008, PPSN.

[19]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[20]  Risto Miikkulainen,et al.  Solving Non-Markovian Control Tasks with Neuro-Evolution , 1999, IJCAI.

[21]  Hiroshi Kinjo,et al.  On the Continuous Control of the Acrobot via Computational Intelligence , 2009, IEA/AIE.

[22]  Verena Heidrich-Meisner,et al.  Neuroevolution strategies for episodic reinforcement learning , 2009, J. Algorithms.

[23]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[24]  Mark W. Spong,et al.  Swing up control of the Acrobot , 1994, Proceedings of the 1994 IEEE International Conference on Robotics and Automation.

[25]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[26]  Javier de Lope,et al.  The kNN-TD Reinforcement Learning Algorithm , 2009 .

[27]  Simon X. Yang,et al.  Comprehensive Unified Control Strategy for Underactuated Two-Link Manipulators , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[28]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[29]  Sean Luke,et al.  A Comparison of Bloat Control Methods for Genetic Programming , 2006, Evolutionary Computation.

[30]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[31]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[32]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[33]  Charles W. Anderson,et al.  Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[34]  Gary Boone,et al.  Minimum-time control of the Acrobot , 1997, Proceedings of International Conference on Robotics and Automation.

[35]  Eiho Uezato,et al.  Swing-up control of a 3-DOF acrobot using an evolutionary approach , 2009, Artificial Life and Robotics.

[36]  Rémi Coulom,et al.  High-accuracy value-function approximation with neural networks applied to the acrobot , 2004, ESANN.

[37]  Peter Stone,et al.  Empowerment for continuous agent—environment systems , 2011, Adapt. Behav..

[38]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[39]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[40]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[41]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[42]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[43]  Hado van Hasselt,et al.  Reinforcement Learning in Continuous State and Action Spaces , 2012, Reinforcement Learning.

[44]  Gary Boone,et al.  Efficient reinforcement learning: model-based Acrobot control , 1997, Proceedings of International Conference on Robotics and Automation.

[45]  Junichiro Yoshimoto,et al.  Acrobot control by learning the switching of multiple controllers , 2005, Artificial Life and Robotics.

[46]  Dimitris C. Dracopoulos,et al.  Genetic programming as a solver to challenging reinforcement learning problems , 2013 .

[47]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[48]  Hiroshi Kinjo,et al.  A switch controller design for the acrobot using neural network and genetic algorithm , 2008, 2008 10th International Conference on Control, Automation, Robotics and Vision.

[49]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1994 .

[50]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[51]  Dominique Bonvin,et al.  Quotient method for controlling the acrobot , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[52]  R. Bellman Dynamic programming. , 1957, Science.

[53]  B.M. Wilamowski,et al.  Neural network architectures and learning algorithms , 2009, IEEE Industrial Electronics Magazine.

[54]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[55]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[56]  M.A. Wiering,et al.  Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[57]  Matthijs T. J. Spaan,et al.  Partially Observable Markov Decision Processes , 2010, Encyclopedia of Machine Learning.

[58]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[59]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[60]  H. Martín,et al.  Ex〈α〉: An effective algorithm for continuous actions Reinforcement Learning problems , 2009 .

[61]  Christian Igel,et al.  Reinforcement learning in a nutshell , 2007, ESANN.

[62]  Stephan K. Chalup,et al.  A small spiking neural network with LQR control applied to the acrobot , 2008, Neural Computing and Applications.

[63]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..