Self-improving reactive agents based on reinforcement learning, planning and teaching

To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus two-fold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than previously studied, and 2) to investigate methods that will speed up reinforcement learning.This paper compares eight reinforcement learning frameworks:adaptive heuristic critic (AHC) learning due to Sutton,Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning. The three extensions are experience replay, learning action models for planning, and teaching. The frameworks were investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The environment is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the frameworks.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[3]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[4]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[5]  D. Rumelhart Learning Internal Representations by Error Propagation, Parallel Distributed Processing , 1986 .

[6]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[7]  Charles W. Anderson,et al.  Strategy Learning with Multilayer Connectionist Representations , 1987 .

[8]  S. Grossberg,et al.  RAMBaf: A Connectionist Expert System That Learns by Example , 1987 .

[9]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[10]  D. Ballard,et al.  A Role for Anticipation in Reactive Systems that Learn , 1989, ML.

[11]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[12]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[13]  Kevin J. Lang A time delay neural network architecture for speech recognition , 1989 .

[14]  Geoffrey E. Hinton,et al.  Distributed Representations , 1986, The Philosophy of Artificial Intelligence.

[15]  Sebastian Thrun,et al.  Planning with an Adaptive World Model , 1990, NIPS.

[16]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[17]  Ming Tan,et al.  Learning a Cost-Sensitive Internal Representation for Reinforcement Learning , 1991, ML.

[18]  A. Moore Variable Resolution Dynamic Programming , 1991, ML.

[19]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[20]  Sridhar Mahadevan,et al.  Scaling Reinforcement Learning to Robotics by Exploiting the Subsumption Architecture , 1991, ML.

[21]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[22]  Long Ji Lin,et al.  Programming Robots Using Reinforcement Learning and Teaching , 1991, AAAI.

[23]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[24]  Long-Ji Lin,et al.  Self-improving reactive agents: case studies of reinforcement learning frameworks , 1991 .

[25]  Long Ji Lin,et al.  Self-improvement Based on Reinforcement Learning, Planning and Teaching , 1991, ML.

[26]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[27]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[28]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[29]  John J. Grefenstette,et al.  Learning sequential decision rules using simulation models and competition , 2004, Machine Learning.

[30]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[31]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.