Evolution of Reinforcement Learning in Uncertain Environments

Acknowledgements I am grateful to Daphna Joel and Eytan Ruppin, my supervisors, for their invaluable guidance and support. To Daphy for boldly venturing into the field of neural network modelling and contributing her clear thoughts, sharp distinctions and original ideas to this work. To Eytan for his wisdom, his endless enthusiasm and his sincere appreciation of the significance of my work. Many thanks to Prof. Isaac Meilijson for the mathematical proof regarding the emergence of risk-aversion. I thank Dr. Tamar Keasar for introducing me to the BeeHave lab at HUJI, for providing me with the probability matching data, and for her invaluable comments and new ideas throughout my research. I have benefitted from discussions with the many people who have read drafts of this work or listened to my talks. All these have helped me immensly in making my ideas coherent and understandable. Special thanks to my family, and especially my father, Yehuda, who followed my research closely, nagged until I finally sat down to write this, and helped me tackle the difficult parts. Abstract Reinforcement learning is a fundamental process by which organisms learn to achieve a goal from interactions with the environment. Using Artificial Life techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels from which emerge the well-documented choice strategies of risk aversion and probability matching. These strategies are shown to be a direct result of reinforcement learning, providing a biologically founded, parsimonious and novel explanation for these behaviors. Our results are corroborated by a rigorous mathematical analysis and their robustness in real-world situations is supported by experiments in a mobile robot.

[1]  R. Nicoll,et al.  Glutamate and gamma-aminobutyric acid mediate a heterosynaptic depression at mossy fiber synapses in the hippocampus. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  David H. Ackley,et al.  Interactions between learning and evolution , 1991 .

[3]  Jean-Marc Fellous,et al.  Computational Models of Neuromodulation , 1998, Neural Computation.

[4]  R. Menzel,et al.  Learning and memory in honeybees: from behavior to neural substrates. , 1996, Annual review of neuroscience.

[5]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[6]  S P Wise,et al.  Distributed modular architectures linking basal ganglia, cerebellum, and cerebral cortex: their role in planning and controlling action. , 1995, Cerebral cortex.

[7]  M. Rothschild,et al.  Increasing risk: I. A definition , 1970 .

[8]  A. Kacelnik,et al.  Risky Theories—The Effects of Variance on Foraging Decisions , 1996 .

[9]  Jonathan Baxter The evolution of learning algorithms for artificial neural networks , 1993 .

[10]  Dario Floreano,et al.  Evolution of Plastic Control Networks , 2001, Auton. Robots.

[11]  E. Kandel,et al.  Is Heterosynaptic modulation essential for stabilizing hebbian plasiticity and memory , 2000, Nature Reviews Neuroscience.

[12]  R. Herrnstein,et al.  The Matching Law Papers in Psychology and Economics , 1997 .

[13]  Geoffrey E. Hinton,et al.  How Learning Can Guide Evolution , 1996, Complex Syst..

[14]  L A Real,et al.  Animal choice behavior and the evolution of cognitive architecture , 1991, Science.

[15]  Anil K. Seth Evolving Behavioural Choice: An Investigation into Herrnstein's Matching Law , 1999, ECAL.

[16]  Richard J. Herrnstein,et al.  MAXIMIZING AND MATCHING ON CONCURRENT RATIO SCHEDULES1 , 1975 .

[17]  W. Regehr,et al.  Mechanism and Kinetics of Heterosynaptic Depression at a Cerebellar Synapse , 1997, The Journal of Neuroscience.

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  W. Schultz,et al.  Dopamine responses comply with basic assumptions of formal learning theory , 2001, Nature.

[20]  A. Dickinson,et al.  Reward-related signals carried by dopamine neurons. , 1995 .

[21]  L. Real,et al.  Why are Bumble Bees Risk Averse , 1987 .

[22]  L. Real Paradox, Performance, and the Architecture of Decision-Making in Animals' , 1996 .

[23]  W. Schultz Predictive reward signal of dopamine neurons. , 1998, Journal of neurophysiology.

[24]  Zhongna Sun,et al.  Pathway-Specific Synaptic Plasticity: Activity-Dependent Enhancement and Suppression of Long-Term Heterosynaptic Facilitation at Converging Inputs on a Single Target , 1997, The Journal of Neuroscience.

[25]  Francesco Mondada,et al.  Evolutionary neurocontrollers for autonomous mobile robots , 1998, Neural Networks.

[26]  Peter Dayan,et al.  Bee foraging in uncertain environments using predictive hebbian learning , 1995, Nature.

[27]  P. D. Smallwood An Introduction to Risk Sensitivity: The Use of Jensen's Inequality to Clarify Evolutionary Arguments of Adaptation and Constraint , 1996 .

[28]  Jeffrey L. Elman,et al.  Learning and Evolution in Neural Networks , 1994, Adapt. Behav..

[29]  W. Schultz Multiple reward signals in the brain , 2000, Nature Reviews Neuroscience.

[30]  Ron Meir,et al.  Evolving a learning algorithm for the binary perceptron , 1991 .

[31]  M. Hammer An identified neuron mediates the unconditioned stimulus in associative olfactory learning in honeybees , 1993, Nature.

[32]  John Maynard Smith,et al.  When learning guides evolution , 1987, Nature.

[33]  P. Montague Biological Substrates of Predictive Mechanisms in Learning and Action Choice , 1997 .

[34]  M. Hammer The neural basis of associative reward learning in honeybees , 1997, Trends in Neurosciences.

[35]  M. Domjan The principles of learning and behavior , 1982 .

[36]  David J. Chalmers,et al.  The Evolution of Learning: An Experiment in Genetic Connectionism , 1991 .

[37]  J. March Learning to be risk averse. , 1996 .

[38]  M. Kimura,et al.  Nigrostriatal dopamine system may contribute to behavioral learning through providing reinforcement signals to the striatum. , 1997, European neurology.

[39]  M. Bitterman PHYLETIC DIFFERENCES IN LEARNING. , 1965, The American psychologist.

[40]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[41]  B. Peleg,et al.  Automata, matching and foraging behavior of bees , 1995 .

[42]  Tamar Keasar,et al.  Bees in two-armed bandit situations: foraging choices and possible decision mechanisms , 2002 .

[43]  T. Aosaki,et al.  Dopamine-Dependent Synaptic Plasticity in the Striatal Cholinergic Interneurons , 2001, The Journal of Neuroscience.

[44]  J. Donahoe,et al.  Neural-network models of cognition : biobehavioral foundations , 1997 .

[45]  Francesco Mondada,et al.  Evolution of homing navigation in a real mobile robot , 1996, IEEE Trans. Syst. Man Cybern. Part B.