Incremental acquisition of behaviors and signs based on a reinforcement learning schemata model and a spike timing-dependent plasticity network

A novel integrative learning architecture based on a reinforcement learning schemata model (RLSM) with a spike timing-dependent plasticity (STDP) network is described. This architecture models operant conditioning with discriminative stimuli in an autonomous agent engaged in multiple reinforcement learning tasks. The architecture consists of two constitutional learning architectures: RLSM and STDP. RLSM is an incremental modular reinforcement learning architecture, and it makes an autonomous agent acquire several behavioral concepts incrementally through continuous interactions with its environment and/or caregivers. STDP is a learning rule of neuronal plasticity found in cerebral cortices and the hippocampus of the human brain. STDP is a temporally asymmetric learning rule that contrasts with the Hebbian learning rule. We found that STDP enabled an autonomous robot to associate auditory input with its acquired behaviors and to select reinforcement learning modules more effectively. Auditory signals interpreted based on the acquired behaviors were revealed to correspond to 'signs' of required behaviors and incoming situations. This integrative learning architecture was evaluated in the context of on-line modular learning.

[1]  Saori C. Tanaka,et al.  Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops , 2004, Nature Neuroscience.

[2]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[3]  Stefano Nolfi,et al.  Learning to perceive the world as articulated: an approach for hierarchical learning in sensory-motor systems , 1998, Neural Networks.

[4]  T. Taniguchi,et al.  Symbol emergence by combining a reinforcement learning schema model with asymmetric synaptic plasticity , 2006 .

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  L. Abbott,et al.  Cortical Development and Remapping through Spike Timing-Dependent Plasticity , 2001, Neuron.

[7]  Minoru Asada,et al.  Cognitive developmental robotics as a new paradigm for the design of humanoid robots , 2001, Robotics Auton. Syst..

[8]  T. Sawaragi,et al.  Design and performance of symbols self-organized within an autonomous agent interacting with varied environments , 2004, RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759).

[9]  Y. Takahashi,et al.  Lexicon Acquisition based on Behavior Learning , 2005, Proceedings. The 4nd International Conference on Development and Learning, 2005..

[10]  E. Capaldi,et al.  The organization of behavior. , 1992, Journal of applied behavior analysis.

[11]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[12]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[13]  H. Markram,et al.  Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997, Science.

[14]  Michael Davis,et al.  The role of the amygdala in fear and anxiety. , 1992, Annual review of neuroscience.

[15]  Vishal Soni,et al.  Reinforcement learning of hierarchical skills on the sony aibo robot , 2005, AAAI 2005.

[16]  L. Abbott,et al.  Competitive Hebbian learning through spike-timing-dependent synaptic plasticity , 2000, Nature Neuroscience.

[17]  Satinder Singh Transfer of learning by composing solutions of elemental sequential tasks , 2004, Machine Learning.

[18]  Steve R. Waterhouse,et al.  Constructive Algorithms for Hierarchical Mixtures of Experts , 1995, NIPS.

[19]  Mitsuo Kawato,et al.  MOSAIC Model for Sensorimotor Learning and Control , 2001, Neural Computation.

[20]  G. Bi,et al.  Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type , 1998, The Journal of Neuroscience.

[21]  D M Wolpert,et al.  Multiple paired forward and inverse models for motor control , 1998, Neural Networks.

[22]  A. Barto,et al.  Adaptive Critics and the Basal Ganglia , 1994 .

[23]  Jun Tani,et al.  Self-organization of distributedly represented multiple behavior schemata in a mirror system: reviews of robot experiments using RNNPB , 2004, Neural Networks.

[24]  H. Abarbanel,et al.  Dynamical model of long-term synaptic plasticity , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[26]  Daniel M. Wolpert,et al.  Hierarchical MOSAIC for movement generation , 2003 .

[27]  Sander M. Bohte,et al.  Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity , 2004, BNAIC.

[28]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[29]  A. Graybiel,et al.  Responses of tonically active neurons in the primate's striatum undergo systematic changes during behavioral sensorimotor conditioning , 1994, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[30]  O. Hikosaka Models of information processing in the basal Ganglia edited by James C. Houk, Joel L. Davis and David G. Beiser, The MIT Press, 1995. $60.00 (400 pp) ISBN 0 262 08234 9 , 1995, Trends in Neurosciences.

[31]  L. Abbott,et al.  Synaptic plasticity: taming the beast , 2000, Nature Neuroscience.

[32]  G. S. Reynolds A Primer of Operant Conditioning , 1968 .

[33]  Robert A. Legenstein,et al.  What Can a Neuron Learn with Spike-Timing-Dependent Plasticity? , 2005, Neural Computation.

[34]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[35]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[36]  Tetsuo Sawaragi,et al.  Self-organization of inner symbols for chase: symbol organization and embodiment , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[37]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38]  Kenji Doya,et al.  What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? , 1999, Neural Networks.

[39]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[40]  Yutaka Sakai,et al.  Synaptic regulation on various STDP rules , 2004, Neurocomputing.

[41]  Sander M. Bohte,et al.  Reducing the Variability of Neural Responses: A Computational Theory of Spike-Timing-Dependent Plasticity , 2007, Neural Computation.

[42]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[43]  Tomoki Fukai,et al.  A Stochastic Method to Predict the Consequence of Arbitrary Forms of Spike-Timing-Dependent Plasticity , 2003, Neural Computation.

[44]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.