Eecient Exploration in Reinforcement Learning

Exploration plays a fundamental role in any active learning system. This study evaluates the role of exploration in active learning and describes several local techniques for exploration in nite, discrete domains, embedded in a reinforcement learning framework (delayed reinforcement). This paper distinguishes between two families of exploration schemes: undirected and directed exploration. While the former family is closely related to random walk exploration, directed exploration techniques memorize exploration-speci c knowledge which is used for guiding the exploration search. In many nite deterministic domains, any learning technique based on undirected exploration is ine cient in terms of learning time, i.e. learning time is expected to scale exponentially with the size of the state space (Whitehead, 1991b). We prove that for all these domains, reinforcement learning using a directed technique can always be performed in polynomial time, demonstrating the important role of exploration in reinforcement learning. (The proof is given for one speci c directed exploration technique named counter-based exploration.) Subsequently, several exploration techniques found in recent reinforcement learning and connectionist adaptive control literature are described. In order to trade o e ciently between exploration and exploitation { a trade-o which characterizes many real-world active learning tasks { combination methods are described which explore and avoid costs simultaneously. This includes a selective attention mechanism, which allows smooth switching between exploration and exploitation. All techniques are evaluated and compared on a discrete reinforcement learning task (robot navigation). The empirical evaluation is followed by an extensive discussion of bene ts and limitations of this work.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[3]  Charles W. Anderson,et al.  Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .

[4]  Ronald L. Rivest,et al.  Diversity-based inference of finite automata , 1994, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[5]  B. Widrow,et al.  The truck backer-upper: an example of self-learning in neural networks , 1989, International 1989 Joint Conference on Neural Networks.

[6]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[7]  Michael I. Jordan,et al.  Learning to Control an Unstable System with Forward Modeling , 1989, NIPS.

[8]  Michael C. Mozer,et al.  Discovering the Structure of a Reactive Environment by Exploration , 1990, Neural Computation.

[9]  Sebastian Thrun,et al.  Planning with an Adaptive World Model , 1990, NIPS.

[10]  Bartlett W. Mel,et al.  Murphy: A neurally-inspired connectionist approach to learning and performance in vision-based robot motion planning , 1990 .

[11]  Donald A. Sofge,et al.  Neural network based process optimization and control , 1990, 29th IEEE Conference on Decision and Control.

[12]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[13]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[14]  Andrew G. Barto,et al.  On the Computational Economics of Reinforcement Learning , 1991 .

[15]  J. Urgen Schmidhuber,et al.  Adaptive confidence and adaptive curiosity , 1991, Forschungsberichte, TU Munich.

[16]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[17]  Sridhar Mahadevan,et al.  Scaling Reinforcement Learning to Robotics by Exploiting the Subsumption Architecture , 1991, ML.

[18]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[19]  Sebastian Thrun,et al.  On Planning And Exploration In Non-Discrete Environments , 1991 .

[20]  A. W. Moore An Intoductory Tutorial on Kd-trees Extract from Andrew Moore's Phd Thesis: Eecient Memory-based L Earning for Robot Control , 1991 .

[21]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[22]  Andrew W. Moore,et al.  Knowledge of knowledge and intelligent experimentation for learning control , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[23]  Long-Ji Lin,et al.  Self-improving reactive agents: case studies of reinforcement learning frameworks , 1991 .

[24]  Long Ji Lin,et al.  Self-improvement Based on Reinforcement Learning, Planning and Teaching , 1991, ML.

[25]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[26]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .