Integrating Temporal Difference Methods and Self-Organizing Neural Networks for Reinforcement Learning With Delayed Evaluative Feedback

This paper presents a neural architecture for learning category nodes encoding mappings across multimodal patterns involving sensory inputs, actions, and rewards. By integrating adaptive resonance theory (ART) and temporal difference (TD) methods, the proposed neural model, called TD fusion architecture for learning, cognition, and navigation (TD-FALCON), enables an autonomous agent to adapt and function in a dynamic environment with immediate as well as delayed evaluative feedback (reinforcement) signals. TD-FALCON learns the value functions of the state-action space estimated through on-policy and off-policy TD learning methods, specifically state-action-reward-state-action (SARSA) and Q-learning. The learned value functions are then used to determine the optimal actions based on an action selection policy. We have developed TD-FALCON systems using various TD learning strategies and compared their performance in terms of task completion, learning speed, as well as time and space efficiency. Experiments based on a minefield navigation task have shown that TD-FALCON systems are able to learn effectively with both immediate and delayed reinforcement and achieve a stable performance in a pace much faster than those of standard gradient-descent-based reinforcement learning systems.

[1]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[2]  Charles W. Anderson,et al.  Q-Learning with Hidden-Unit Restarting , 1992, NIPS.

[3]  Derong Liu,et al.  Direct Neural Dynamic Programming , 2004 .

[4]  Michael L. Anderson Embodied Cognition: A field guide , 2003, Artif. Intell..

[5]  Shohei Kato,et al.  A Dynamic Allocation Method of Basis Functions in Reinforcement Learning , 2004, Australian Conference on Artificial Intelligence.

[6]  Ah-Hwee Tan,et al.  Adaptive resonance associative map , 1995, Neural Networks.

[7]  Ah-Hwee Tan,et al.  FALCON: a fusion architecture for learning, cognition, and navigation , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[8]  George G. Lendaris,et al.  Adaptive critic based design of a fuzzy motor speed controller , 2001, Proceeding of the 2001 IEEE International Symposium on Intelligent Control (ISIC '01) (Cat. No.01CH37206).

[9]  D. Gordon A Cognitive Model of Learning to Navigate , 1997 .

[10]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[11]  Ah-Hwee Tan,et al.  Rule Extraction: From Neural Architecture to Symbolic Representation , 1995 .

[12]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[13]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[14]  David H. Ackley,et al.  Generalization and Scaling in Reinforcement Learning , 1989, NIPS.

[15]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[16]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[17]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[18]  Andrés Pérez Uribe,et al.  Structure-Adaptable Digital Neural Networks , 1999 .

[19]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[20]  Stephen Grossberg,et al.  A massively parallel architecture for a self-organizing neural pattern recognition machine , 1988, Comput. Vis. Graph. Image Process..

[21]  Tetsuhiro Miyahara,et al.  Fuzzy Q-learning with the modified fuzzy ART neural network , 2005, IEEE/WIC/ACM International Conference on Intelligent Agent Technology.

[22]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Risto Miikkulainen,et al.  Self-Organizing Perceptual and Temporal Abstraction for Robot Reinforcement Learning , 2004, AAAI 2004.

[25]  Chee Kheong Siew,et al.  Real-time learning capability of neural networks , 2006, IEEE Trans. Neural Networks.

[26]  John C. Platt A Resource-Allocating Network for Function Interpolation , 1991, Neural Computation.

[27]  Jiann-Ming Wu,et al.  Function approximation using generalized adalines , 2006, IEEE Transactions on Neural Networks.

[28]  Ron Sun,et al.  From implicit skills to explicit knowledge: a bottom-up model of skill learning , 2001, Cogn. Sci..

[29]  Stephen Grossberg,et al.  Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system , 1991, Neural Networks.

[30]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[31]  Andrew James Smith,et al.  Applications of the self-organising map to reinforcement learning , 2002, Neural Networks.

[32]  Chee Kheong Siew,et al.  Universal Approximation using Incremental Constructive Feedforward Networks with Random Hidden Nodes , 2006, IEEE Transactions on Neural Networks.

[33]  Andres Perez-Uribe,et al.  Structure-Adaptable Digital Neural Networks , 1999 .

[34]  Long Ji Lin,et al.  Programming Robots Using Reinforcement Learning and Teaching , 1991, AAAI.

[35]  Jennie Si,et al.  ADP: Goals, Opportunities and Principles , 2004 .

[36]  Stephen Grossberg,et al.  Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps , 1992, IEEE Trans. Neural Networks.

[37]  Gerald Tesauro,et al.  TD-Gammon: A Self-Teaching Backgammon Program , 1995 .