When Does Reward Maximization Lead to Matching Law?

What kind of strategies subjects follow in various behavioral circumstances has been a central issue in decision making. In particular, which behavioral strategy, maximizing or matching, is more fundamental to animal's decision behavior has been a matter of debate. Here, we prove that any algorithm to achieve the stationary condition for maximizing the average reward should lead to matching when it ignores the dependence of the expected outcome on subject's past choices. We may term this strategy of partial reward maximization “matching strategy”. Then, this strategy is applied to the case where the subject's decision system updates the information for making a decision. Such information includes subject's past actions or sensory stimuli, and the internal storage of this information is often called “state variables”. We demonstrate that the matching strategy provides an easy way to maximize reward when combined with the exploration of the state variables that correctly represent the crucial information for reward maximization. Our results reveal for the first time how a strategy to achieve matching behavior is beneficial to reward maximization, achieving a novel insight into the relationship between maximizing and matching.

[1]  D. W. Hands The Matching Law: Papers In Psychology And Economics , 1999 .

[2]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[3]  Saori C. Tanaka,et al.  Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops , 2004, Nature Neuroscience.

[4]  W Vaughan,et al.  Melioration, matching, and maximization. , 1981, Journal of the experimental analysis of behavior.

[5]  M. Davison,et al.  The matching law: A research review. , 1988 .

[6]  W. Baum,et al.  Matching, undermatching, and overmatching in studies of choice. , 1979, Journal of the experimental analysis of behavior.

[7]  H. Seung,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 581–617 NUMBER 3(NOVEMBER) LINEAR-NONLINEAR-POISSON MODELS OF PRIMATE CHOICE DYNAMICS , 2022 .

[8]  W. Newsome,et al.  Matching Behavior and the Representation of Value in the Parietal Cortex , 2004, Science.

[9]  L. T. DeCarlo Matching and maximizing with variable-time schedules. , 1985, Journal of the experimental analysis of behavior.

[10]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[11]  D. Shanks,et al.  A Re-examination of Probability Matching and Rational Choice , 2002 .

[12]  G M Heyman,et al.  A Markov model description of changeover probabilities on concurrent variable-interval schedules. , 1979, Journal of the experimental analysis of behavior.

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  M. Davison,et al.  Effects of varying stimulus disparity and the reinforcer ratio in concurrent-schedule and signal-detection procedures. , 1991, Journal of the experimental analysis of behavior.

[15]  G M Heyman,et al.  Is matching compatible with reinforcement maximization on concurrent variable interval variable ratio? , 1979, Journal of the experimental analysis of behavior.

[16]  R. Herrnstein,et al.  CHAPTER 5 – Melioration and Behavioral Allocation1 , 1980 .

[17]  J. E. Mazur Optimization theory fails to predict performance of pigeons in a two-response situation. , 1981, Science.

[18]  R J HERRNSTEIN,et al.  Relative and absolute strength of response as a function of frequency of reinforcement. , 1961, Journal of the experimental analysis of behavior.

[19]  O. Hikosaka Models of information processing in the basal Ganglia edited by James C. Houk, Joel L. Davis and David G. Beiser, The MIT Press, 1995. $60.00 (400 pp) ISBN 0 262 08234 9 , 1995, Trends in Neurosciences.

[20]  M. Davison,et al.  Sensitivity of time allocation to an overall reinforcer rate feedback function in concurrent interval schedules. , 1989, Journal of the experimental analysis of behavior.

[21]  J. Nevin,et al.  Stimuli, reinforcers, and behavior: an integration. , 1999, Journal of the experimental analysis of behavior.

[22]  Xiao-Jing Wang,et al.  A Biophysically Based Neural Model of Matching Law Behavior: Melioration by Stochastic Synapses , 2006, The Journal of Neuroscience.

[23]  Yutaka Sakai,et al.  The Actor-Critic Learning Is Behind the Matching Law: Matching Versus Optimal Behaviors , 2008, Neural Computation.

[24]  J. Staddon,et al.  Limits to action, the allocation of individual behavior , 1982 .

[25]  W M Baum,et al.  Optimization and the matching law as accounts of instrumental behavior. , 1981, Journal of the experimental analysis of behavior.

[26]  P. Chance Learning and Behavior , 1979 .

[27]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[28]  Yonatan Loewenstein,et al.  Operant matching is a generic outcome of synaptic plasticity based on the covariance between reward and neural activity , 2006, Proceedings of the National Academy of Sciences.

[29]  Yutaka Sakai,et al.  Computational algorithms and neuronal network models underlying decision processes , 2006, Neural Networks.