Unified Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Artificial behavioral agents are often evaluated based on their consistent behaviors and performance to take sequential actions in an environment to maximize some notion of cumulative reward. However, human decision making in real life usually involves different strategies and behavioral trajectories that lead to the same empirical outcome. Motivated by clinical literature of a wide range of neurological and psychiatric disorders, we propose here a more general and flexible parametric framework for sequential decision making that involves a two-stream reward processing mechanism. We demonstrated that this framework is flexible and unified enough to incorporate a family of problems spanning multi-armed bandits (MAB), contextual bandits (CB) and reinforcement learning (RL), which decompose the sequential decision making process in different levels. Inspired by the known reward processing abnormalities of many mental disorders, our clinically-inspired agents demonstrated interesting behavioral trajectories and comparable performance on simulated tasks with particular reward distributions, a real-world dataset capturing human decision-making in gambling tasks, and the PacMan game across different reward stationarities in a lifelong learning setting.

[1]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[2]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[3]  Xinxin Zhang,et al.  VoiceID on the Fly: A Speaker Recognition System that Learns from Scratch , 2020, INTERSPEECH.

[4]  Karl J. Friston,et al.  Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[5]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[6]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[7]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[8]  Nicolò Cesa-Bianchi,et al.  On-line learning with malicious noise and the closure algorithm , 1998, Annals of Mathematics and Artificial Intelligence.

[9]  Djallel Bouneffouf,et al.  Optimal Epidemic Control as a Contextual Combinatorial Bandit with Budget , 2021, ArXiv.

[10]  A. Damasio,et al.  Insensitivity to future consequences following damage to human prefrontal cortex , 1994, Cognition.

[11]  Speaker Diarization as a Fully Online Learning Problem in MiniVox , 2020, ArXiv.

[12]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[13]  P. Glimcher,et al.  Phasic Dopamine Release in the Rat Nucleus Accumbens Symmetrically Encodes a Reward Prediction Error Term , 2014, The Journal of Neuroscience.

[14]  A. Holmes,et al.  The Myth of Optimality in Clinical Neuroscience , 2018, Trends in Cognitive Sciences.

[15]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[16]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[17]  Djallel Bouneffouf,et al.  Split Q Learning: Reinforcement Learning with Two-Stream Rewards , 2019, IJCAI.

[18]  P. Dayan,et al.  Reinforcement learning: The Good, The Bad and The Ugly , 2008, Current Opinion in Neurobiology.

[19]  Djallel Bouneffouf,et al.  Contextual Bandit with Adaptive Feature Extraction , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[20]  Baihan Lin Online Semi-Supervised Learning in Contextual Bandits with Episodic Reward , 2020, Australasian Conference on Artificial Intelligence.

[21]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[22]  M. Frank,et al.  From reinforcement learning models to psychiatric and neurological disorders , 2011, Nature Neuroscience.

[23]  Woojae Kim,et al.  Cognitive Mechanisms Underlying Risky Decision-Making in Chronic Cannabis Users. , 2010, Journal of mathematical psychology.

[24]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[25]  A. Tversky,et al.  The framing of decisions and the psychology of choice. , 1981, Science.

[26]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[27]  Michael J. Frank,et al.  By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[28]  Predicting human decision making in psychological tasks with recurrent neural networks , 2020, ArXiv.

[29]  Djallel Bouneffouf,et al.  Bandit Models of Human Behavior: Reward Processing in Mental Disorders , 2017, AGI.

[30]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[31]  Baihan Lin,et al.  Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior , 2020, ArXiv.

[32]  Raphaël Féraud,et al.  Multi-armed bandit problem with known trend , 2015, Neurocomputing.

[33]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[34]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[35]  Stefan Elfwing,et al.  Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the MaxPain algorithm , 2017, 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[36]  Arno Villringer,et al.  Iowa Gambling Task: There is More to Consider than Long-Term Outcome. Using a Linear Equation Model to Disentangle the Impact of Outcome and Frequency of Gains and Losses , 2012, Front. Neurosci..

[37]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[38]  J. Kramer,et al.  Reward processing in neurodegenerative disease , 2015, Neurocase.

[39]  James L. McClelland,et al.  Data from 617 Healthy Participants Performing the Iowa Gambling Task: A “Many Labs” Collaboration , 2015 .

[40]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[41]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[42]  Michael J. Frank,et al.  A mechanistic account of striatal dopamine function in human cognition: psychopharmacological studies with cabergoline and haloperidol. , 2006, Behavioral neuroscience.

[43]  Raphaël Féraud,et al.  Context Attentive Bandits: Contextual Bandit with Restricted Context , 2017, IJCAI.

[44]  R. Dolan,et al.  The neurobiology of punishment , 2007, Nature Reviews Neuroscience.