Optimal response vigor and choice under non-stationary outcome values

Within a rational framework, a decision-maker selects actions based on the reward-maximization principle, which stipulates that they acquire outcomes with the highest value at the lowest cost. Action selection can be divided into two dimensions: selecting an action from various alternatives, and choosing its vigor, i.e., how fast the selected action should be executed. Both of these dimensions depend on the values of outcomes, which are often affected as more outcomes are consumed together with their associated actions. Despite this, previous research has only addressed the computational substrate of optimal actions in the specific condition that the values of outcomes are constant. It is not known what actions are optimal when the values of outcomes are non-stationary. Here, based on an optimal control framework, we derive a computational model for optimal actions when outcome values are non-stationary. The results imply that, even when the values of outcomes are changing, the optimal response rate is constant rather than decreasing. This finding shows that, in contrast to previous theories, commonly observed changes in action rate cannot be attributed solely to changes in outcome value. We then prove that this observation can be explained based on uncertainty about temporal horizons; e.g., the session duration. We further show that, when multiple outcomes are available, the model explains probability matching as well as maximization strategies. The model therefore provides a quantitative analysis of optimal action and explicit predictions for future testing.

[1]  B. Skinner,et al.  Principles of Behavior , 1944 .

[2]  J. Neumann,et al.  Theory of Games and Economic Behavior. , 1945 .

[3]  M. Sidman,et al.  Satiation effects under fixed-ratio schedules of reinforcement. , 1954, Journal of comparative and physiological psychology.

[4]  C. B. Ferster,et al.  Schedules of reinforcement , 1957 .

[5]  R J HERRNSTEIN,et al.  Relative and absolute strength of response as a function of frequency of reinforcement. , 1961, Journal of the experimental analysis of behavior.

[6]  B. Weiss,et al.  Behavioral Thermoregulation , 1961, Science.

[7]  J. W. Kling,et al.  Amount of reinforcement and free-operant responding. , 1961, Journal of the experimental analysis of behavior.

[8]  A. G. Hundt,et al.  REINFORCEMENT OF DRINKING BY RUNNING: EFFECT OF FIXED RATIO AND REINFORCEMENT TIME. , 1964, Journal of the experimental analysis of behavior.

[9]  M. Bitterman PHYLETIC DIFFERENCES IN LEARNING. , 1965, The American psychologist.

[10]  M. Felton,et al.  The post-reinforcement pause. , 1966, Journal of the experimental analysis of behavior.

[11]  I. Barofsky,et al.  Within ratio responding during fixed ratio performance , 1968 .

[12]  R. W. Powell The effect of small sequential changes in fixed-ratio size upon the post-reinforcement pause. , 1968, Journal of the experimental analysis of behavior.

[13]  R. W. Powell,et al.  The effect of reinforcement magnitude upon responding under fixed-ratio schedules. , 1969, Journal of the experimental analysis of behavior.

[14]  G. Davey,et al.  Effects of reinforcement magnitude on interval and ratio schedules. , 1974, Journal of the experimental analysis of behavior.

[15]  D. Quartermain,et al.  Food motivated behavior in genetically obese and hypothalamic-hyperphagic rats and mice. , 1974, Physiology & behavior.

[16]  R. Herrnstein,et al.  Maximizing and matching on concurrent ratio schedules. , 1975, Journal of the experimental analysis of behavior.

[17]  Richard J. Herrnstein,et al.  MAXIMIZING AND MATCHING ON CONCURRENT RATIO SCHEDULES1 , 1975 .

[18]  J. Allison,et al.  Fixed-ratio lever pressing by VMH rats: Work vs accessibility of sucrose reward , 1976, Physiology & Behavior.

[19]  E. Adair,et al.  Behavioral thermoregulation in the squirrel monkey when response effort is varied. , 1976, Journal of comparative and physiological psychology.

[20]  J. Gibbon Scalar expectancy theory and Weber's law in animal timing. , 1977 .

[21]  G. Meunier,et al.  On the magnitude of reinforcement and fixed-ratio behavior , 1979 .

[22]  宇野 洋二,et al.  Formation and control of optimal trajectory in human multijoint arm movement : minimum torque-change model , 1988 .

[23]  J. Hinson,et al.  Patterns of responding within sessions. , 1992, Journal of the experimental analysis of behavior.

[24]  W. Baum,et al.  Performances on ratio and interval schedules of reinforcement: Data and theory. , 1993, Journal of the experimental analysis of behavior.

[25]  B. Balleine,et al.  Motivational control of goal-directed action , 1994 .

[26]  P. Killeen Mathematical principles of reinforcement , 1994 .

[27]  W. Estes Toward a Statistical Theory of Learning. , 1994 .

[28]  Within-session changes in responding during several simple schedules. , 1994, Journal of the experimental analysis of behavior.

[29]  A. Poling,et al.  The effects of differing response-force requirements on fixed-ratio responding of rats. , 1995, Journal of the experimental analysis of behavior.

[30]  P. Killeen Economics, ecologics, and mechanics: The dynamics of responding under conditions of varying motivation. , 1995, Journal of the experimental analysis of behavior.

[31]  Cari B. Cannon,et al.  Sensitization–habituation may occur during operant conditioning. , 1996 .

[32]  M. Foster,et al.  Open versus closed economies: performance of domestic hens under fixed ratio schedules. , 1997, Journal of the experimental analysis of behavior.

[33]  J. Salamone,et al.  Nucleus accumbens dopamine depletions make rats more sensitive to high ratio requirements but do not impair primary food reinforcement , 1999, Neuroscience.

[34]  Nir Vulkan An Economist's Perspective on Probability Matching , 2000 .

[35]  H. Rachlin The Science of Self-Control , 2004 .

[36]  C. Gallistel,et al.  Time, rate, and conditioning. , 2000, Psychological review.

[37]  J. Pear The Science of Learning , 2001 .

[38]  MPR , 2003, Behavioural Processes.

[39]  M. Kawato,et al.  Formation and control of optimal trajectory in human multijoint arm movement , 1989, Biological Cybernetics.

[40]  F. Mcsweeney,et al.  Dynamic changes in reinforcer effectiveness: Satiation and habituation have different implications for theory and practice , 2004, The Behavior analyst.

[41]  P. Dayan,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.8 Full text provided by www.sciencedirect.com A normative perspective on motivation , 2022 .

[42]  P. Dayan,et al.  Tonic dopamine: opportunity costs and the control of response vigor , 2007, Psychopharmacology.

[43]  Y. Niv THE EFFECTS OF MOTIVATION ON HABITUAL INSTRUMENTAL BEHAVIOR , 2007 .

[44]  R. Malott,et al.  Principles of Behavior , 2007 .

[45]  W. Gaissmaier,et al.  The smart potential behind probability matching , 2008, Cognition.

[46]  Yutaka Sakai,et al.  The Actor-Critic Learning Is Behind the Matching Law: Matching Versus Optimal Behaviors , 2008, Neural Computation.

[47]  H. Sebastian Seung,et al.  Operant Matching as a Nash Equilibrium of an Intertemporal Game , 2009, Neural Computation.

[48]  Y. Niv,et al.  The effects of motivation on response rate: A hidden semi-Markov model analysis of behavioral dynamics , 2011, Journal of Neuroscience Methods.

[49]  P. Dayan Instrumental vigour in punishment and reward , 2012, The European journal of neuroscience.

[50]  Daniel Liberzon,et al.  Calculus of Variations and Optimal Control Theory: A Concise Introduction , 2012 .

[51]  Konrad Paul Kording,et al.  An Examination of the Generalizability of Motor Costs , 2013, PloS one.

[52]  Stefano Fusi,et al.  Dynamical Regimes in Neural Network Models of Matching Behavior , 2013, Neural Computation.

[53]  Joseph T. McGuire,et al.  Rational Temporal Predictions Can Underlie Apparent Failures to Delay Gratification Theoretical Perspectives on Delay-of-gratification Failure Dual Systems Strength and Depletion Environmental Cuing Hyperbolic Discounting a Normative Perspective Time Prediction during Delay of Gratification Temporal , 2022 .

[54]  Samuel P. León,et al.  Within- and between-session variety effects in a food-seeking habituation paradigm , 2013, Appetite.

[55]  K. Miyazaki,et al.  Nucleus accumbens , 2018, Radiopaedia.org.

[56]  Mehdi Keramati,et al.  Homeostatic reinforcement learning for integrating reward collection and physiological stability , 2014, eLife.

[57]  Peter Dayan,et al.  Some Work and Some Play: Microscopic and Macroscopic Approaches to Labor and Leisure , 2014, PLoS Comput. Biol..

[58]  R. Shadmehr,et al.  Motor Costs and the Coordination of the Two Arms , 2014, The Journal of Neuroscience.

[59]  Ben R. Newell,et al.  Of matchers and maximizers: How competition shapes choice under risk and uncertainty , 2015, Cognitive Psychology.

[60]  B. Newell,et al.  Taking the easy way out? Increasing implementation effort reduces probability maximizing under cognitive load , 2016, Memory & cognition.

[61]  J. Kubanek Optimal decision making and matching are tied through diminishing returns , 2017, Proceedings of the National Academy of Sciences.