A Novel Model for Arbitration Between Planning and Habitual Control Systems

It is well-established that human decision making and instrumental control uses multiple systems, some which use habitual action selection and some which require deliberate planning. Deliberate planning systems use predictions of action-outcomes using an internal model of the agent's environment, while habitual action selection systems learn to automate by repeating previously rewarded actions. Habitual control is computationally efficient but are not very flexible in changing environments. Conversely, deliberate planning may be computationally expensive, but flexible in dynamic environments. This paper proposes a general architecture comprising both control paradigms by introducing an arbitrator that controls which subsystem is used at any time. This system is implemented for a target-reaching task with a simulated two-joint robotic arm that comprises a supervised internal model and deep reinforcement learning. Through permutation of target-reaching conditions, we demonstrate that the proposed is capable of rapidly learning kinematics of the system without a priori knowledge, and is robust to (A) changing environmental reward and kinematics, and (B) occluded vision. The arbitrator model is compared to exclusive deliberate planning with the internal model and exclusive habitual control instances of the model. The results show how such a model can harness the benefits of both systems, using fast decisions in reliable circumstances while optimizing performance in changing environments. In addition, the proposed model learns very fast. Finally, the system which includes internal models is able to reach the target under the visual occlusion, while the pure habitual system is unable to operate sufficiently under such conditions.

[1]  Daniel M. Wolpert,et al.  Forward Models for Physiological Motor Control , 1996, Neural Networks.

[2]  C. Von Hofsten An action perspective on motor development. , 2004, Trends in cognitive sciences.

[3]  Karl J. Friston,et al.  Computational psychiatry , 2012, Trends in Cognitive Sciences.

[4]  Joel Z. Leibo,et al.  Model-Free Episodic Control , 2016, ArXiv.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Shinsuke Shimojo,et al.  Neural Computations Underlying Arbitration between Model-Based and Model-free Learning , 2013, Neuron.

[7]  M. Frank,et al.  Computational psychiatry as a bridge from neuroscience to clinical applications , 2016, Nature Neuroscience.

[8]  D. Wolpert,et al.  Is the cerebellum a smith predictor? , 1993, Journal of motor behavior.

[9]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[10]  Peter Dayan,et al.  Hippocampal Contributions to Control: The Third Way , 2007, NIPS.

[11]  A. Barto,et al.  Adaptive Critics and the Basal Ganglia , 1994 .

[12]  Colin Camerer,et al.  Neuroeconomics: decision making and the brain , 2008 .

[13]  Y. Demiris,et al.  From motor babbling to hierarchical learning by imitation: a robot developmental pathway , 2005 .

[14]  R. Mazo On the theory of brownian motion , 1973 .

[15]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[16]  E. Uchibe,et al.  Reinforcement Learning with Multiple Heterogeneous Modules: A Framework for Developmental Robot Learning , 2005, Proceedings. The 4nd International Conference on Development and Learning, 2005..

[17]  S. Amari Dynamics of pattern formation in lateral-inhibition type neural fields , 1977, Biological Cybernetics.

[18]  J. Iverson,et al.  The relationship between reduplicated babble onset and laterality biases in infant rhythmic arm movements , 2007, Brain and Language.

[19]  J R Flanagan,et al.  The Role of Internal Models in Motion Planning and Control: Evidence from Grip Force Adjustments during Movements of Hand-Held Loads , 1997, The Journal of Neuroscience.

[20]  R. Klein,et al.  Motivational modulation of endogenous inputs to the superior colliculus , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[21]  A. Barto,et al.  1 Supervised Actor-Critic Reinforcement Learning , 2007 .

[22]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[23]  Jörn Diedrichsen,et al.  Reach adaptation: what determines whether we learn an internal model of the tool or adapt the model of our arm? , 2008, Journal of neurophysiology.

[24]  Sang Wan Lee,et al.  The structure of reinforcement-learning mechanisms in the human brain , 2015, Current Opinion in Behavioral Sciences.

[25]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[26]  M. Gluck,et al.  Interactive memory systems in the human brain , 2001, Nature.

[27]  E. Miller,et al.  An integrative theory of prefrontal cortex function. , 2001, Annual review of neuroscience.

[28]  M. Rosenstein,et al.  Supervised Learning Combined with an Actor-Critic Architecture TITLE2: , 2002 .

[29]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[30]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[31]  D. Westwood,et al.  No Evidence for Accurate Visuomotor Memory: Systematic and Variable Error in Memory-Guided Reaching , 2003, Journal of motor behavior.

[32]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[33]  K. Fu,et al.  A heuristic approach to reinforcement learning control systems , 1965 .

[34]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[35]  M. Kawato,et al.  Internal forward models in the cerebellum: fMRI study on grip force and load force coupling. , 2003, Progress in brain research.

[36]  J. Iverson,et al.  Infant vocal-motor coordination: precursor to the gesture-speech system? , 2004, Child development.

[37]  Matthew Heath,et al.  The control of memory-guided reaching movements in peripersonal space. , 2004, Motor control.

[38]  B. Balleine,et al.  Goal-directed instrumental action: contingency and incentive learning and their cortical substrates , 1998, Neuropharmacology.

[39]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[40]  D. Wolpert,et al.  Internal models in the cerebellum , 1998, Trends in Cognitive Sciences.

[41]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[42]  J. Mazziotta,et al.  Cortical mechanisms of human imitation. , 1999, Science.

[43]  Domenico Parisi,et al.  Using Motor Babbling and Hebb Rules for Modeling the Development of Reaching with Obstacles and Grasping , 2008 .

[44]  N. Daw,et al.  Multiple Systems for Value Learning , 2014 .

[45]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[46]  Dietmar Heinke,et al.  Modeling human target reaching with an adaptive observer implemented with dynamic neural fields , 2015, Neural Networks.

[47]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[48]  O. Hikosaka Models of information processing in the basal Ganglia edited by James C. Houk, Joel L. Davis and David G. Beiser, The MIT Press, 1995. $60.00 (400 pp) ISBN 0 262 08234 9 , 1995, Trends in Neurosciences.

[49]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[50]  Richard S. Sutton,et al.  Sequential Decision Problems and Neural Networks , 1989, NIPS 1989.