Optimal habits can develop spontaneously through sensitivity to local cost

Habits and rituals are expressed universally across animal species. These behaviors are advantageous in allowing sequential behaviors to be performed without cognitive overload, and appear to rely on neural circuits that are relatively benign but vulnerable to takeover by extreme contexts, neuropsychiatric sequelae, and processes leading to addiction. Reinforcement learning (RL) is thought to underlie the formation of optimal habits. However, this theoretic formulation has principally been tested experimentally in simple stimulus-response tasks with relatively few available responses. We asked whether RL could also account for the emergence of habitual action sequences in realistically complex situations in which no repetitive stimulus-response links were present and in which many response options were present. We exposed naïve macaque monkeys to such experimental conditions by introducing a unique free saccade scan task. Despite the highly uncertain conditions and no instruction, the monkeys developed a succession of stereotypical, self-chosen saccade sequence patterns. Remarkably, these continued to morph for months, long after session-averaged reward and cost (eye movement distance) reached asymptote. Prima facie, these continued behavioral changes appeared to challenge RL. However, trial-by-trial analysis showed that pattern changes on adjacent trials were predicted by lowered cost, and RL simulations that reduced the cost reproduced the monkeys’ behavior. Ultimately, the patterns settled into stereotypical saccade sequences that minimized the cost of obtaining the reward on average. These findings suggest that brain mechanisms underlying the emergence of habits, and perhaps unwanted repetitive behaviors in clinical disorders, could follow RL algorithms capturing extremely local explore/exploit tradeoffs.

[1]  D. Robinson The mechanics of human saccadic eye movement , 1964, The Journal of physiology.

[2]  A. Dickinson Actions and habits: the development of behavioural autonomy , 1985 .

[3]  R. Ridley,et al.  The psychology of perseverative and stereotyped behaviour , 1994, Progress in Neurobiology.

[4]  Peter Dayan,et al.  Bee foraging in uncertain environments using predictive hebbian learning , 1995, Nature.

[5]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[6]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[7]  David J. Foster,et al.  A model of hippocampally dependent navigation, using the temporal difference learning rule , 2000, Hippocampus.

[8]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9]  D. Barraclough,et al.  Prefrontal cortex and decision making in a mixed-strategy game , 2004, Nature Neuroscience.

[10]  K. Doya,et al.  Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[13]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[14]  E. Vaadia,et al.  Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[15]  William J. Cook,et al.  The Traveling Salesman Problem: A Computational Study , 2007 .

[16]  A. Graybiel Habits, rituals, and the evaluative brain. , 2008, Annual review of neuroscience.

[17]  Richard W. Byrne,et al.  How do wild baboons (Papio ursinus) plan their routes? Travel among multiple high-quality food sources with inter-group competition , 2009, Animal Cognition.

[18]  Albert-László Barabási,et al.  Limits of Predictability in Human Mobility , 2010, Science.