Mixing Habits and Planning for Multi-Step Target Reaching Using Arbitrated Predictive Actor-Critic

Internal models are important when agents make decisions based on predictions of future states and their utilities. However, using internal models for planning can be time consuming. Therefore, it can be useful to use a habitual system for repetitive tasks that can be executed faster and with reduced algorithmic resources. Current evidence suggests that the brain uses both control systems, planning and habitual systems for behavioural control, which then requires an arbitration between these two systems. In our previous work [1], we proposed an Arbitrated Predictive Actor-Critic (APAC), which is a neural architecture demonstrating cooperative mechanisms of planning and habitual control systems for one step mapping. The present study tests the ability of such a model to control a simulated twojoints robotic arm during multiple reaching tasks with movement limitations that require multiple steps to solve the task. Our results show that APAC can learn the multi-step learning under various conditions. Interestingly, the APAC tends to shift from planning to habits by taking actions predicted by a habitual controller over the training time.

[1]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[2]  Mitsuo Kawato,et al.  Internal models for motor control and trajectory planning , 1999, Current Opinion in Neurobiology.

[3]  M. Kawato,et al.  A hierarchical neural-network model for control and learning of voluntary movement , 2004, Biological Cybernetics.

[4]  Michael T. Rosenstein,et al.  Supervised Actor‐Critic Reinforcement Learning , 2012 .

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  Sang Wan Lee,et al.  The structure of reinforcement-learning mechanisms in the human brain , 2015, Current Opinion in Behavioral Sciences.

[7]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[8]  N. Daw,et al.  Multiple Systems for Value Learning , 2014 .

[9]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[10]  Thomas P. Trappenberg,et al.  A Novel Model for Arbitration Between Planning and Habitual Control Systems , 2017, Front. Neurorobot..

[11]  M. Rosenstein,et al.  Supervised Learning Combined with an Actor-Critic Architecture TITLE2: , 2002 .

[12]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[13]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[14]  Scott T. Grafton,et al.  Forward modeling allows feedback control for fast reaching movements , 2000, Trends in Cognitive Sciences.

[15]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[16]  M. Jeannerod,et al.  Constraints on human arm movement trajectories. , 1987, Canadian journal of psychology.

[17]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[18]  Shinsuke Shimojo,et al.  Neural Computations Underlying Arbitration between Model-Based and Model-free Learning , 2013, Neuron.

[19]  Michael I. Jordan,et al.  An internal model for sensorimotor integration. , 1995, Science.

[20]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[21]  D. Wolpert,et al.  Internal models in the cerebellum , 1998, Trends in Cognitive Sciences.

[22]  A. Barto,et al.  1 Supervised Actor-Critic Reinforcement Learning , 2007 .

[23]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[24]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[25]  Michael I. Jordan Computational aspects of motor control and motor learning , 2008 .

[26]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[27]  G. Uhlenbeck,et al.  On the Theory of the Brownian Motion , 1930 .