When, What, and How Much to Reward in Reinforcement Learning-Based Models of Cognition

Reinforcement learning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other interval of task performance), what (the objective function: e.g., performance time or performance accuracy), and how much (the magnitude: with binary, categorical, or continuous values). In this article, we explore the problem space of these three parameters in the context of a task whose completion entails some combination of 36 state-action pairs, where all intermediate states (i.e., after the initial state and prior to the end state) represent progressive but partial completion of the task. Different choices produce profoundly different learning paths and outcomes, with the strongest effect for moment. Unfortunately, there is little discussion in the literature of the effect of such choices. This absence is disappointing, as the choice of when, what, and how much needs to be made by a modeler for every learning model.

[1]  John R. Anderson,et al.  Reflections of the Environment in Memory Form of the Memory Functions , 2022 .

[2]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[3]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[4]  R. Herrnstein Behavior, Reinforcement and Utility , 1990 .

[5]  N. Daw,et al.  Reinforcement learning and higher level cognition: Introduction to special issue , 2009, Cognition.

[6]  Hansjörg Neth,et al.  Feedback Design for the Control of a Dynamic Multitasking System: Dissociating Outcome Feedback From Control Feedback , 2008, Hum. Factors.

[7]  John R. Anderson,et al.  Extending the Computational Abilities of the Procedural Learning Mechanism in ACT-R , 2004 .

[8]  I. Erev,et al.  On adaptation, maximization, and reinforcement learning among cognitive strategies. , 2005, Psychological review.

[9]  Ron Sun,et al.  From implicit skills to explicit knowledge: a bottom-up model of skill learning , 2001, Cogn. Sci..

[10]  D. Shanks,et al.  A Re-examination of Probability Matching and Rational Choice , 2002 .

[11]  D. Fum,et al.  Rewards and Punishments in Iterated Decision Making : An Explanation for the Frequency of the Contingent Event Effect , 2010 .

[12]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[13]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[14]  Nick Chater,et al.  Identifying Optimum Performance Trade-Offs Using a Cognitively Bounded Rational Analysis Model of Discretionary Task Interleaving , 2011, Top. Cogn. Sci..

[15]  Wayne D. Gray,et al.  Suboptimal tradeoffs in information seeking , 2006, Cognitive Psychology.

[16]  Wayne D. Gray,et al.  The soft constraints hypothesis: a rational analysis approach to resource allocation for interactive behavior. , 2006, Psychological review.

[17]  John R. Anderson,et al.  From recurrent choice to skill learning: a reinforcement-learning model. , 2006, Journal of experimental psychology. General.

[18]  Wayne D. Gray,et al.  Milliseconds Matter: an Introduction to Microstrategies and to Their Use in Describing and Predicting Interactive Behavior Milliseconds Matter: an Introduction to Microstrategies and to Their Use in Describing and Predicting Interactive Behavior , 2022 .

[19]  Eldad Yechiam,et al.  Comparison of basic assumptions embedded in learning models for experience-based decision making , 2005, Psychonomic bulletin & review.

[20]  Sophia L. King,et al.  Improving memory after interruption: exploiting soft constraints and manipulating information access cost. , 2009, Journal of experimental psychology. Applied.

[21]  W. Edwards Optimal strategies for seeking information: Models for statistics, choice reaction times, and human information processing ☆ , 1965 .

[22]  I. Scott MacKenzie,et al.  Towards a standard for pointing device evaluation, perspectives on 27 years of Fitts' law research in HCI , 2004, Int. J. Hum. Comput. Stud..

[23]  Wayne D. Gray,et al.  Melioration Dominates Maximization: Stable Suboptimal Performance Despite Global Feedback , 2006 .

[24]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[25]  Wai-Tat Fu,et al.  Soft constraints in interactive behavior: the case of ignoring perfect knowledge in-the-world for imperfect knowledge in-the-head , 2004, Cogn. Sci..

[26]  Michael X. Cohen,et al.  Neurocomputational mechanisms of reinforcement-guided learning in humans: A review , 2008, Cognitive, affective & behavioral neuroscience.

[27]  Jerome R. Busemeyer,et al.  Comparison of Decision Learning Models Using the Generalization Criterion Method , 2008, Cogn. Sci..

[28]  John R. Anderson How Can the Human Mind Occur in the Physical Universe , 2007 .

[29]  J. Rieskamp,et al.  SSL: a theory of how people learn to select strategies. , 2006, Journal of experimental psychology. General.

[30]  Wayne D. Gray,et al.  Adapting to the task environment: Explorations in expected value , 2005, Cognitive Systems Research.

[31]  P. Fitts The information capacity of the human motor system in controlling the amplitude of movement. , 1954, Journal of experimental psychology.

[32]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[33]  Robert W Proctor,et al.  Acquisition and Transfer of Attention Allocation Strategies in a Multiple-Task Work Environment , 2007, Hum. Factors.

[34]  John R. Anderson,et al.  The strategic nature of changing your mind , 2009, Cognitive Psychology.

[35]  D. G. Davis,et al.  The process of recurrent choice. , 1993, Psychological review.

[36]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37]  Bradley C. Love,et al.  Short-term gains, long-term pains: How cues about state aid learning in dynamic environments , 2009, Cognition.

[38]  Phillip L. Morgan,et al.  Influencing Cognitive Strategy by Manipulating Information Access , 2007, Comput. J..

[39]  Dana H. Ballard,et al.  On the Role of Embodiment in Modeling Natural Behaviors , 2007, Integrated Models of Cognitive Systems.

[40]  John R. Anderson,et al.  Rules of the Mind , 1993 .

[41]  Richard L. Lewis,et al.  Rational adaptation under task and processing constraints: implications for testing theories of cognition and action. , 2009, Psychological review.

[42]  W. Schultz Behavioral theories and the neurophysiology of reward. , 2006, Annual review of psychology.

[43]  Clay B. Holroyd,et al.  The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. , 2002, Psychological review.

[44]  D. Ballard,et al.  Memory Representations in Natural Tasks , 1995, Journal of Cognitive Neuroscience.

[45]  John E. Laird,et al.  Soar-RL: integrating reinforcement learning with Soar , 2005, Cognitive Systems Research.

[46]  C. Lebiere,et al.  The Atomic Components of Thought , 1998 .

[47]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[48]  John R Anderson,et al.  An integrated theory of the mind. , 2004, Psychological review.

[49]  Duncan P. Brumby,et al.  Strategic Adaptation to Performance Objectives in a Dual-Task Setting , 2010, Cogn. Sci..

[50]  P. Fitts The information capacity of the human motor system in controlling the amplitude of movement. 1954. , 1992, Journal of experimental psychology. General.

[51]  Wayne D. Gray,et al.  Episodic versus Semantic Memory: An Exploration of Models of Memory Decay in the Serial Attention Paradigm , 2004, ICCM.

[52]  Rajesh P. N. Rao,et al.  Embodiment is the foundation, not a level , 1996, Behavioral and Brain Sciences.