RLOC: Neurobiologically Inspired Hierarchical Reinforcement Learning Algorithm for Continuous Control of Nonlinear Dynamical Systems

Nonlinear optimal control problems are often solved with numerical methods that require knowledge of system's dynamics which may be difficult to infer, and that carry a large computational cost associated with iterative calculations. We present a novel neurobiologically inspired hierarchical learning framework, Reinforcement Learning Optimal Control, which operates on two levels of abstraction and utilises a reduced number of controllers to solve nonlinear systems with unknown dynamics in continuous state and action spaces. Our approach is inspired by research at two levels of abstraction: first, at the level of limb coordination human behaviour is explained by linear optimal feedback control theory. Second, in cognitive tasks involving learning symbolic level action selection, humans learn such problems using model-free and model-based reinforcement learning algorithms. We propose that combining these two levels of abstraction leads to a fast global solution of nonlinear control problems using reduced number of controllers. Our framework learns the local task dynamics from naive experience and forms locally optimal infinite horizon Linear Quadratic Regulators which produce continuous low-level control. A top-level reinforcement learner uses the controllers as actions and learns how to best combine them in state space while maximising a long-term reward. A single optimal control objective function drives high-level symbolic learning by providing training signals on desirability of each selected controller. We show that a small number of locally optimal linear controllers are able to solve global nonlinear control problems with unknown dynamics when combined with a reinforcement learner in this hierarchical framework. Our algorithm competes in terms of computational cost and solution quality with sophisticated control algorithms and we illustrate this with solutions to benchmark problems.

[1]  Yiannis Demiris,et al.  Learning Forward Models for Robots , 2005, IJCAI.

[2]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[3]  M. Athans,et al.  Optimal limited state variable feedback controllers for linear systems , 1971 .

[4]  Jan Peters,et al.  Learning complex motions by sequencing simpler motion templates , 2009, ICML '09.

[5]  Brian D. O. Anderson,et al.  Linear Optimal Control , 1971 .

[6]  Russ Tedrake,et al.  LQR-trees: Feedback motion planning on sparse randomized trees , 2009, Robotics: Science and Systems.

[7]  Samuel M. McClure,et al.  BOLD Responses Reflecting Dopaminergic Signals in the Human Ventral Tegmental Area , 2008, Science.

[8]  W. Marsden I and J , 2012 .

[9]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[10]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[11]  Marc Toussaint,et al.  Modelling motion primitives and their timing in biologically executed movements , 2007, NIPS.

[12]  Emilio Bizzi,et al.  Combinations of muscle synergies in the construction of a natural motor behavior , 2003, Nature Neuroscience.

[13]  Verena Heidrich-Meisner Interview with Richard S. Sutton , 2009, Künstliche Intell..

[14]  Junichiro Yoshimoto,et al.  Acrobot control by learning the switching of multiple controllers , 2005, Artificial Life and Robotics.

[15]  C. Atkeson,et al.  Learning arm kinematics and dynamics. , 1989, Annual review of neuroscience.

[16]  Doina Precup,et al.  Algorithms for multi-armed bandit problems , 2014, ArXiv.

[17]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[18]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[19]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[20]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[21]  Emanuel Todorov,et al.  Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems , 2004, ICINCO.

[22]  Etienne Burdet,et al.  Dissociating Variability and Effort as Determinants of Coordination , 2009, PLoS Comput. Biol..

[23]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[24]  Christopher G. Atkeson,et al.  Policies based on trajectory libraries , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[25]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[26]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[27]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[28]  E. Todorov Optimality principles in sensorimotor control , 2004, Nature Neuroscience.

[29]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[30]  Christopher G. Atkeson,et al.  Random Sampling of States in Dynamic Programming , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[31]  Richard Bellman,et al.  Dynamic Programming and Stochastic Control Processes , 1958, Inf. Control..

[32]  Geoffrey E. Hinton,et al.  Parameter estimation for linear dynamical systems , 1996 .

[33]  Balaraman Ravindran,et al.  Hierarchical Optimal Control of MDPs , 1998 .

[34]  Chih-Han Yu,et al.  Quadruped robot obstacle negotiation via reinforcement learning , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[35]  Carl E. Rasmussen,et al.  Probabilistic Inference for Fast Learning in Control , 2008, EWRL.

[36]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[37]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[38]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[39]  A. Faisal,et al.  Noise in the nervous system , 2008, Nature Reviews Neuroscience.

[40]  Daniel A. Braun,et al.  Optimal Control Predicts Human Performance on Objects with Internal Degrees of Freedom , 2009, PLoS Comput. Biol..

[41]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[42]  Faisal Aldo Deriving motion primitives from naturalistic hand movements for neuroprosthetic control , 2012 .

[43]  L. Grüne,et al.  Nonlinear Model Predictive Control : Theory and Algorithms. 2nd Edition , 2011 .

[44]  R. Lozano,et al.  Stabilization of the inverted pendulum around its homoclinic orbit , 2000 .

[45]  Mitsuo Kawato,et al.  MOSAIC Model for Sensorimotor Learning and Control , 2001, Neural Computation.

[46]  S. Scott Optimal feedback control and the neural basis of volitional motor control , 2004, Nature Reviews Neuroscience.

[47]  Peter J Hellyer,et al.  The Control of Global Brain Dynamics: Opposing Actions of Frontoparietal Control and Default Mode Networks on Attention , 2014, The Journal of Neuroscience.

[48]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[49]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[50]  Vicenç Gómez,et al.  Dynamic Policy Programming with Function Approximation , 2011, AISTATS.

[51]  Rajesh P. N. Rao,et al.  Bayesian brain : probabilistic approaches to neural coding , 2006 .

[52]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[53]  Huibert Kwakernaak,et al.  Linear Optimal Control Systems , 1972 .

[54]  Karl J. Friston What Is Optimal about Motor Control? , 2011, Neuron.

[55]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[56]  Gerald E. Loeb,et al.  Optimal isn’t good enough , 2012, Biological Cybernetics.

[57]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[58]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[59]  Roger D. Quinn,et al.  Deep Dynamic Programming: Optimal Control with Continuous Model Learning of a Nonlinear Muscle Actuated Arm , 2017, Living Machines.

[60]  Andrew G. Barto,et al.  Combining Reinforcement Learning with a Local Control Algorithm , 2000, ICML.

[61]  Yuval Tassa,et al.  Stochastic Differential Dynamic Programming , 2010, Proceedings of the 2010 American Control Conference.

[62]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[63]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[64]  Marc Peter Deisenroth,et al.  Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control , 2017, AISTATS.

[65]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[66]  Emanuel Todorov,et al.  From task parameters to motor synergies: A hierarchical framework for approximately optimal control of redundant manipulators , 2005, J. Field Robotics.

[67]  Frédo Durand,et al.  Linear Bellman combination for control of character animation , 2009, ACM Trans. Graph..

[68]  E. Bizzi,et al.  Linear combinations of primitives in vertebrate motor control. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[70]  Emanuel Todorov,et al.  Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system , 2007, Int. J. Control.

[71]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.