Value and reward based learning in neurorobots

Organisms are equipped with value systems that signal the salience of environmental cues to their nervous system, causing a change in the nervous system that results in modification of their behavior. These systems are necessary for an organism to adapt its behavior when an important environmental event occurs. A value system constitutes a basic assumption of what is good and bad for an agent. These value systems have been effectively used in robotic systems to shape behavior. For example, many robots have used models of the dopaminergic system to reinforce behavior that leads to rewards. Other modulatory systems that shape behavior are acetylcholine's effect on attention, norepinephrine's effect on vigilance, and serotonin's effect on impulsiveness, mood, and risk. Moreover, hormonal systems such as oxytocin and its effect on trust constitute as a value system. A recent Research Topic in Frontiers of Neurorobotics explored value and reward based learning. The topic comprised of nine papers on research involving neurobiologically inspired robots whose behavior was shaped by value and reward learning, adapted through interaction with the environment, or shaped by extracting value from the environment. Value systems are often linked to reward systems in neurobiology and in modeling. For example, Jayet Bray and her colleagues developed a neurorobotic system that learned to categorize the valence of speech through positive verbal encouragement, much like a baby would (Jayet Bray et al., 2013). Their virtual robot, which interacted with a human partner, was controlled by a large-scale spiking neuron model of the visual cortex, premotor cortex, and reward system. An important issue in both biological and artificial reward systems is the credit assignment problem that is, how can a distal cue be linked to a reward. In other words, how can you extract the stimulus that predicts a future reward from all the noisy stimuli that you are faced with? Soltoggio and colleagues introduce the principle of rare correlations to resolve this issue (Soltosggio et al., 2013). By using Rarely Correlating Hebbian Plasticity, they demonstrated classical and operant conditioning in a set of human-robot experiments with the iCub robot. The notion of value and reward has often been formalized in reinforcement learning systems. For example, Li and colleagues show that reinforcement learning, in the form of a dynamic actor-critic model, can be used to tune central pattern generators in a humanoid robot (Li et al., 2013). Through interaction with the environment, this dynamical system developed biped locomotion on a NAO robot that could adapt its gaits to different conditions. Elfwing and colleagues introduced a scaled version of free-energy reinforcement learning (FERL) and applied it to visual recognition and navigation tasks (Elfwing et al., 2013). This novel algorithm was shown to be significantly better than standard FERL and feedforward neural network RL. Another related method, Linearly solvable Markov Decision Process (LMDP) has been shown to have advantages over RL in optimal control policy (Kinjo et al., 2013). Kinjo and colleagues demonstrated the power of LMDP for robot control by applying the method to a pole balancing task, and a visually guided navigation problem using their Spring Dog robot which has six degrees-of-freedom. Value does need not be reward-based; curiosity, harm, novelty, and uncertainty can all carry a value signal. For example, in a biomimetic model of the cortex, basal ganglia and phasic dopamine, Bolado-Gomez and colleagues (Bolado-Gomez and Gurney, 2013) showed that intrinsically motivated operant learning (i.e., action discovery) could replicate rodent experiments, in a virtual robot. In this case, phasic dopaminergic neuromodulation carried a novelty salience signal, rather than the more conventional reward signal. In a model called CURIOUSity-DRiven, Modular, Incremental Slow Feature Analysis (Curious Dr. MISFA), Luciw and colleagues showed that curiosity could shape the behavior of an iCub robot in a multi-context environment (Luciw et al., 2013). Their model was inspired by cortical regions of the brain involved in unsupervised learning, as well as neuromodulatory systems responsible for providing intrinsic rewards through dopamine and regulating levels of attention through norepinephrine. Different neuromodulatory systems in the brain may be related to different aspects of value (Krichmar, 2013). In a model of multiple neuromodulatory systems, Krichmar showed that interactions between the dopaminergic (reward), serotoninergic (harm aversion), and the cholinergic/noradrenergic (novelty) systems could lead to interesting behavioral control in an autonomous robot. Finally, in an interesting position paper, Friston, Adams, and Montague suggest that value is evidence, specifically log Bayesian evidence (Friston et al., 2012). They propose that reward or cost functions that underlie value in conventional models of optimal control can be cast as prior beliefs about future states, which is simply accumulation of evidence through Bayesian updating of posterior beliefs. As can be gleaned from reading the papers in the Research Topic, as well as the empirical evidence and studies they are built on, Value and Reward Based Learning is an active and broad area of research. The application to neurorobotics is important for several reasons: (1) It provides an embodied platform for testing hypotheses regarding the neural correlates of value and reward, (2) it provides a means to test more theoretical hypotheses on the acquisition of value and its function for biological and artificial systems, and (3) it may lead to the development of improved learning systems in robots and other autonomous agents.

[1]  Anders Green,et al.  Social and collaborative aspects of interaction with a service robot , 2003, Robotics Auton. Syst..

[2]  Kenji Doya,et al.  Scaled free-energy based reinforcement learning for robust and efficient learning in high-dimensional state spaces , 2013, Front. Neurorobot..

[3]  S. Scott,et al.  Positive Emotions Preferentially Engage an Auditory–Motor “Mirror” System , 2006, The Journal of Neuroscience.

[4]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[5]  Y. Niv Reinforcement learning in the brain , 2009 .

[6]  Emanuel Todorov,et al.  Eigenfunction approximation methods for linearly-solvable optimal control problems , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[7]  Kenji Doya,et al.  How can we learn efficiently to act optimally and flexibly? , 2009, Proceedings of the National Academy of Sciences.

[8]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[9]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[10]  Li I. Zhang,et al.  A critical window for cooperation and competition among developing retinotectal synapses , 1998, Nature.

[11]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[12]  Rajesh P. N. Rao,et al.  Bayesian brain : probabilistic approaches to neural coding , 2006 .

[13]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[14]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[15]  K Caluwaerts,et al.  A biologically inspired meta-control navigation system for the Psikharpax rat robot , 2012, Bioinspiration & biomimetics.

[16]  Frederick C. Harris,et al.  Implementation of a Biologically Realistic Parallel Neocortical-Neural Network Simulator , 2001, PPSC.

[17]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[18]  Angelo Arleo,et al.  Spatial cognition and neuro-mimetic navigation: a model of hippocampal place cell activity , 2000, Biological Cybernetics.

[19]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[20]  Philippe Gaussier,et al.  Autonomous vision-based navigation: Goal-oriented action planning by transient states prediction, cognitive map building, and sensory-motor learning , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Emanuel Todorov,et al.  Compositionality of optimal control laws , 2009, NIPS.

[22]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[23]  Narayan Srinivasa,et al.  A Spiking Neural Model for Stable Reinforcement of Synapses Based on Multiple Distal Rewards , 2013, Neural Computation.

[24]  Alejandra Barrera,et al.  Biologically-inspired robot spatial cognition based on rat neurophysiological studies , 2008, Auton. Robots.

[25]  Gordon Wyeth,et al.  Persistent Navigation and Mapping using a Biologically Inspired SLAM System , 2010, Int. J. Robotics Res..

[26]  Rosalind W. Picard Affective computing: challenges , 2003, Int. J. Hum. Comput. Stud..

[27]  Peter Stone,et al.  Generalized model learning for Reinforcement Learning on a humanoid robot , 2010, 2010 IEEE International Conference on Robotics and Automation.

[28]  Kenji Doya,et al.  Free-Energy Based Reinforcement Learning for Vision-Based Navigation with High-Dimensional Sensory Inputs , 2010, ICONIP.

[29]  Pierre-Yves Oudeyer,et al.  The production and recognition of emotions in speech: features and algorithms , 2003, Int. J. Hum. Comput. Stud..

[30]  Jürgen Schmidhuber,et al.  An intrinsic value system for developing multiple invariant representations with incremental slowness learning , 2013, Front. Neurorobot..

[31]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[32]  Jochen J. Steil,et al.  Rare Neural Correlations Implement Robotic Conditioning with Delayed Rewards and Disturbances , 2013, Front. Neurorobot..

[33]  Jeffrey L. Krichmar,et al.  A neurorobotic platform to test the influence of neuromodulatory signaling on anxious and curious behavior , 2013, Front. Neurorobot..

[34]  Kevin N. Gurney,et al.  A biologically plausible embodied model of action discovery , 2012, Front. Neurorobot..

[35]  Jan Peters,et al.  Model learning for robot control: a survey , 2011, Cognitive Processing.

[36]  Jeffrey L. Krichmar,et al.  Spatial navigation and causal analysis in a brain-based device modeling cortical-hippocampal interactions , 2007, Neuroinformatics.

[37]  Kenji Doya,et al.  The Cyber Rodent Project: Exploration of Adaptive Mechanisms for Self-Preservation and Self-Reproduction , 2005, Adapt. Behav..

[38]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[39]  N. Logothetis,et al.  Where Are the Human Speech and Voice Regions, and Do Other Animals Have Anything Like Them? , 2009, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[40]  Takafumi Kanamori,et al.  Least-Squares Conditional Density Estimation , 2010, IEICE Trans. Inf. Syst..

[41]  Kenji Doya,et al.  Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task , 2013, Front. Neurorobot..

[42]  Tom Ziemke,et al.  Humanoids Learning to Walk: A Natural CPG-Actor-Critic Architecture , 2013, Front. Neurorobot..

[43]  N. Logothetis,et al.  A voice region in the monkey brain , 2008, Nature Neuroscience.

[44]  Frederick C. Harris,et al.  Reward-based learning for virtual neurorobotics through emotional speech processing , 2013, Front. Neurorobot..

[45]  H. Kappen Linear theory for control of nonlinear stochastic systems. , 2004, Physical review letters.

[46]  Olivier Sigaud,et al.  On-line regression algorithms for learning mechanical models of robots: A survey , 2011, Robotics Auton. Syst..

[47]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[48]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[49]  Ioannis Pitas,et al.  Automatic emotional speech classification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[50]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[51]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.