论文信息 - Finding minimal action sequences with a simple evaluation of actions

Finding minimal action sequences with a simple evaluation of actions

Animals are able to discover the minimal number of actions that achieves an outcome (the minimal action sequence). In most accounts of this, actions are associated with a measure of behavior that is higher for actions that lead to the outcome with a shorter action sequence, and learning mechanisms find the actions associated with the highest measure. In this sense, previous accounts focus on more than the simple binary signal of “was the outcome achieved?”; they focus on “how well was the outcome achieved?” However, such mechanisms may not govern all types of behavioral development. In particular, in the process of action discovery (Redgrave and Gurney, 2006), actions are reinforced if they simply lead to a salient outcome because biological reinforcement signals occur too quickly to evaluate the consequences of an action beyond an indication of the outcome's occurrence. Thus, action discovery mechanisms focus on the simple evaluation of “was the outcome achieved?” and not “how well was the outcome achieved?” Notwithstanding this impoverishment of information, can the process of action discovery find the minimal action sequence? We address this question by implementing computational mechanisms, referred to in this paper as no-cost learning rules, in which each action that leads to the outcome is associated with the same measure of behavior. No-cost rules focus on “was the outcome achieved?” and are consistent with action discovery. No-cost rules discover the minimal action sequence in simulated tasks and execute it for a substantial amount of time. Extensive training, however, results in extraneous actions, suggesting that a separate process (which has been proposed in action discovery) must attenuate learning if no-cost rules participate in behavioral development. We describe how no-cost rules develop behavior, what happens when attenuation is disrupted, and relate the new mechanisms to wider computational and biological context.

Ashvin Shah | Kevin N. Gurney | K. Gurney | Ashvin Shah

[1] J. Horvitz. Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events , 2000, Neuroscience.

[2] Daeyeol Lee,et al. Beyond working memory: the role of persistent activity in decision making , 2010, Trends in Cognitive Sciences.

[3] Joel L. Davis,et al. A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[4] Wg Lehnert,et al. THE HEDONISTIC NEURON - A THEORY OF MEMORY, LEARNING, AND INTELLIGENCE - KLOPF,AH , 1983 .

[5] Jennie Si,et al. Supervised ActorCritic Reinforcement Learning , 2004 .

[6] A. Barto,et al. Novelty or Surprise? , 2013, Front. Psychol..

[7] Kevin Gurney,et al. Action Discovery and Intrinsic Motivation: A Biologically Constrained Formalisation , 2013, Intrinsically Motivated Learning in Natural and Artificial Systems.

[8] Kevin Gurney,et al. A Novel Task for the Investigation of Action Acquisition , 2012, PloS one.

[9] Joel Myerson,et al. Exponential Versus Hyperbolic Discounting of Delayed Outcomes: Risk and Waiting Time , 1996 .

[10] R. Thaler. Some empirical evidence on dynamic inconsistency , 1981 .

[11] Christos Dimitrakakis,et al. Computational and Robotic Models of the Hierarchical Organization of Behavior , 2012 .

[12] Ashvin Shah,et al. A computational model of muscle recruitment for wrist movements. , 2002, Journal of neurophysiology.

[13] P. Dayan,et al. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[14] SinghSatinder,et al. Between MDPs and semi-MDPs , 1999 .

[15] W. Pan,et al. Dopamine Cells Respond to Predicted Events during Classical Conditioning: Evidence for Eligibility Traces in the Reward-Learning Network , 2005, The Journal of Neuroscience.

[16] Ashvin Shah,et al. A Dual Process Account of Coarticulation in Motor Skill Acquisition , 2013, Journal of motor behavior.

[17] Jürgen Schmidhuber,et al. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[18] Michael I. Jordan,et al. Optimal feedback control as a theory of motor coordination , 2002, Nature Neuroscience.

[19] P. Redgrave,et al. Functional properties of the basal ganglia's re-entrant loop architecture: selection and reinforcement , 2011, Neuroscience.

[20] John S. Edwards,et al. The Hedonistic Neuron: A Theory of Memory, Learning and Intelligence , 1983 .

[21] Peter Dayan,et al. A Neural Substrate of Prediction and Reward , 1997, Science.

[22] Kevin N. Gurney,et al. A biologically plausible embodied model of action discovery , 2012, Front. Neurorobot..

[23] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[24] R. H. Strotz. Myopia and Inconsistency in Dynamic Utility Maximization , 1955 .

[25] Stephen Hart,et al. The development of hierarchical knowledge in robot systems , 2009 .

[26] John M. Ennis,et al. A neurobiological theory of automaticity in perceptual categorization. , 2007, Psychological review.

[27] Wulfram Gerstner,et al. Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail , 2009, PLoS Comput. Biol..

[28] E. Miller,et al. Different time courses of learning-related activity in the prefrontal cortex and striatum , 2005, Nature.

[29] B. Balleine,et al. Reward‐guided learning beyond dopamine in the nucleus accumbens: the integrative functions of cortico‐basal ganglia networks , 2008, The European journal of neuroscience.

[30] E. Kandel,et al. Cognitive Neuroscience and the Study of Memory , 1998, Neuron.

[31] G. Bi,et al. Synaptic modification by correlated activity: Hebb's postulate revisited. , 2001, Annual review of neuroscience.

[32] Takemi Otsuki,et al. Functional Properties of CD8+ Lymphocytes in Patients with Pleural Plaque and Malignant Mesothelioma , 2014, Journal of immunology research.

[33] A. Hendrickson,et al. Human photoreceptor topography , 1990, The Journal of comparative neurology.

[34] Andrew G. Barto,et al. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[35] Sridhar Mahadevan,et al. Basis function construction for hierarchical reinforcement learning , 2010, AAMAS.

[36] George Konidaris,et al. Autonomous Robot Skill Acquisition , 2008, AAAI.

[37] J. W. Aldridge,et al. Dissecting components of reward: 'liking', 'wanting', and learning. , 2009, Current opinion in pharmacology.

[38] P. Goldman-Rakic. Cellular basis of working memory , 1995, Neuron.

[39] S. H. Chung,et al. Effects of delayed reinforcement in a concurrent situation. , 1965, Journal of the experimental analysis of behavior.

[40] I. Pavlov. Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex , 1929 .

[41] T. Poggio,et al. Nonlinear interactions in a dendritic tree: localization, timing, and role in information processing. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[42] P. Dayan,et al. States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[43] W. Schultz,et al. Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[44] H. Markram,et al. Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997, Science.

[45] Mitsuo Kawato,et al. Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning , 2006, Neural Networks.

[46] Ashvin Shah,et al. Psychological and Neuroscientific Connections with Reinforcement Learning , 2012, Reinforcement Learning.

[47] K. Berridge,et al. What is the role of dopamine in reward: hedonic impact, reward learning, or incentive salience? , 1998, Brain Research Reviews.

[48] Pierre-Yves Oudeyer,et al. What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[49] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50] D. Norman. Learning and Memory , 1982 .

[51] K. Doya,et al. Multiple Representations of Belief States and Action Values in Corticobasal Ganglia Loops , 2007, Annals of the New York Academy of Sciences.

[52] B. Balleine,et al. Human and Rodent Homologies in Action Control: Corticostriatal Determinants of Goal-Directed and Habitual Action , 2010, Neuropsychopharmacology.

[53] Kenji Doya,et al. Combining Modalities with Different Latencies for Optimal Motor Control , 2008, Journal of Cognitive Neuroscience.

[54] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[55] A. Barto,et al. Effect on movement selection of an evolving sensory representation: A multiple controller model of skill acquisition , 2009, Brain Research.

[56] S. Scott,et al. Nonuniform distribution of reach-related and torque-related activity in upper arm muscles and neurons of primary motor cortex. , 2006, Journal of neurophysiology.

[57] A. Barto,et al. Cortical involvement in the recruitment of wrist muscles. , 2004, Journal of neurophysiology.

[58] Sridhar Mahadevan,et al. Representation Discovery in Sequential Decision Making , 2010, AAAI.

[59] Karl J. Friston,et al. Active inference and agency: optimal control without cost functions , 2012, Biological Cybernetics.

[60] Andrew G. Barto,et al. Behavioral Hierarchy: Exploration and Representation , 2013, Computational and Robotic Models of the Hierarchical Organization of Behavior.

[61] A. Dickinson. Actions and habits: the development of behavioural autonomy , 1985 .

[62] Mitsuo Kawato,et al. Feedback-Error-Learning Neural Network for Supervised Motor Learning , 1990 .

[63] F A LOGAN,et al. DECISION MAKING BY RATS: DELAY VERSUS AMOUNT OF REWARD. , 1965, Journal of comparative and physiological psychology.

[64] T. SHALLICE,et al. Learning and Memory , 1970, Nature.

[65] Thomas J. Wills,et al. The development of spatial behaviour and the hippocampal neural representation of space , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[66] Kevin Gurney,et al. Dopamine-mediated action discovery promotes optimal behavior ‘for free’ , 2011, BMC Neuroscience.

[67] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[68] W. Schultz,et al. Responses of monkey dopamine neurons during learning of behavioral reactions. , 1992, Journal of neurophysiology.

[69] W. Schultz. Updating dopamine reward signals , 2013, Current Opinion in Neurobiology.

[70] P. Samuelson. A Note on Measurement of Utility , 1937 .

[71] L. Green,et al. A discounting framework for choice with delayed and probabilistic rewards. , 2004, Psychological bulletin.

[72] Kevin W. Bowyer,et al. The Functional Properties , 1996 .

[73] A G Barto,et al. Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[74] Kevin Gurney,et al. The Role of the Basal Ganglia in Discovering Novel Actions , 2013, Intrinsically Motivated Learning in Natural and Artificial Systems.

[75] I. Pavlov,et al. Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex , 2010, Annals of Neurosciences.

[76] Zeb Kurth-Nelson,et al. Temporal-Difference Reinforcement Learning with Distributed Representations , 2009, PloS one.

[77] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[78] B. Balleine,et al. The integrative function of the basal ganglia in instrumental conditioning , 2009, Behavioural Brain Research.

[79] J. Pearce. Animal Learning and Cognition: An Introduction , 1997 .

[80] John H. R. Maunsell,et al. The visual field representation in striate cortex of the macaque monkey: Asymmetries, anisotropies, and individual variability , 1984, Vision Research.

[81] T. Lillicrap,et al. Preference Distributions of Primary Motor Cortex Neurons Reflect Control Solutions Optimized for Limb Biomechanics , 2013, Neuron.

[82] J. Wickens,et al. Neural mechanisms of reward-related motor learning , 2003, Current Opinion in Neurobiology.

[83] Andrew G. Barto,et al. Intrinsic Motivation and Reinforcement Learning , 2013, Intrinsically Motivated Learning in Natural and Artificial Systems.

[84] P. Redgrave,et al. What is reinforced by phasic dopamine signals? , 2008, Brain Research Reviews.

[85] Balaraman Ravindran,et al. SMDP Homomorphisms: An Algebraic Approach to Abstraction in Semi-Markov Decision Processes , 2003, IJCAI.

[86] M. London,et al. Dendritic computation. , 2005, Annual review of neuroscience.

[87] Yael Niv,et al. Operant Conditioning , 1971 .

[88] Emilio Kropff,et al. Place cells, grid cells, and the brain's spatial representation system. , 2008, Annual review of neuroscience.

[89] Peter Stone,et al. Reinforcement learning from human reward: Discounting in episodic tasks , 2012, 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.

[90] Nuttapong Chentanez,et al. Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[91] B. Knowlton,et al. Learning and memory functions of the Basal Ganglia. , 2002, Annual review of neuroscience.

[92] E. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[93] Michael T. Rosenstein,et al. Supervised Actor‐Critic Reinforcement Learning , 2012 .

[94] Matthew A. Wilson,et al. Neural Representation of Spatial Topology in the Rodent Hippocampus , 2013, Neural Computation.

[95] Sridhar Mahadevan,et al. Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[96] Y. Niv. Reinforcement learning in the brain , 2009 .

[97] H. Bergman,et al. Goal-directed and habitual control in the basal ganglia: implications for Parkinson's disease , 2010, Nature Reviews Neuroscience.

[98] B. Roche,et al. The Behavior of Organisms? , 1997 .

[99] Florentin Wörgötter,et al. Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms , 2005, Neural Computation.

[100] W. Brown. Animal Intelligence: Experimental Studies , 1912, Nature.

[101] Ashvin Shah. Biologically-based functional mechanisms of motor skill acquisition , 2008 .

[102] W. Schultz,et al. Learning of sequential movements by neural network model with dopamine-like reinforcement signal , 1998, Experimental Brain Research.

[103] L. Green,et al. Discounting of delayed rewards: Models of individual choice. , 1995, Journal of the experimental analysis of behavior.

[104] K. Berridge. The debate over dopamine’s role in reward: the case for incentive salience , 2007, Psychopharmacology.

[105] E. Thorndike. Animal intelligence; experimental studies, by Edward L. Thorndike. , 1911 .

[106] Bartlett W. Mel,et al. Translation-Invariant Orientation Tuning in Visual “Complex” Cells Could Derive from Intradendritic Computations , 1998, The Journal of Neuroscience.

[107] S. Scott. Inconvenient Truths about neural processing in primary motor cortex , 2008, The Journal of physiology.

[108] Benjamin O. Turner,et al. Cortical and basal ganglia contributions to habit learning and automaticity , 2010, Trends in Cognitive Sciences.

[109] Giovanni Pezzulo,et al. A spiking neuron model of the cortico-basal ganglia circuits for goal-directed and habitual action learning. , 2013, Neural networks : the official journal of the International Neural Network Society.

[110] Antonio Pedotti,et al. Optimization of muscle-force sequencing in human locomotion , 1978 .

[111] S. Ostlund,et al. Phasic Mesolimbic Dopamine Signaling Precedes and Predicts Performance of a Self-Initiated Action Sequence Task , 2012, Biological Psychiatry.

[112] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .