Utility-Based On-Line Exploration for Repeated Navigation in an Embedded Graph

In this paper, we address the tradeoff between exploration and exploitation for agents which need to learn more about the structure of their environment in order to perform more effectively. For example, a robot may need to learn the most efficient routes between important sites in its environment. We compare on-line and off-line exploration for a repeated task, where the agent is given some particular task to perform some number of times. Tasks are modeled as navigation on a graph embedded in the plane. This paper describes a utility-based on-line exploration algorithm for repeated tasks, which takes into account both the costs and potential benefits (over future task repetitions) of different exploratory actions. Exploration is performed in a greedy fashion, with the locally optimal exploratory action performed on each task repetition. We experimentally evaluated our utility-based on-line algorithm against a heuristic search algorithm for off-line exploration as well as a randomized on-line exploration algorithm. We found that for a single repeated task, utility-based on-line exploration consistently outperforms the alternatives, unless the number of task repetitions is very high. In addition, we extended the algorithms for the case of multiple repeated tasks, where the agent has a different randomly-chosen task to perform each time. Here too, we found that utility-based on-line exploration is often preferred.

[1]  Xiaotie Deng,et al.  How to learn an unknown environment , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[2]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[3]  T. Ishida,et al.  Improving the Learning E ciencies of Realtime Search , 1996 .

[4]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[5]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[6]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[7]  Peter Haddawy,et al.  Efficient Decision-Theoretic Planning: Techniques and Empirical Analysis , 1995, UAI.

[8]  Azriel Rosenfeld,et al.  Learning in Navigation Goal Finding in Graphs , 1996, Int. J. Pattern Recognit. Artif. Intell..

[9]  Manuela M. Veloso,et al.  Efficient Goal-Directed Exploration , 1996, AAAI/IAAI, Vol. 1.

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Ronald L. Rivest,et al.  Inference of finite automata using homing sequences , 1989, STOC '89.

[12]  Leslie Pack Kaelbling,et al.  Inferring finite automata with stochastic output functions and an application to map learning , 1992, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[13]  R. Rivest,et al.  Piecemeal Learning of an Unknown Environment , 1993, COLT '93.

[14]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[15]  Grigoris I. Karakoulas Probabilistic Exploration in Planning while Learning , 1995, UAI.

[16]  Toru Ishida,et al.  Improving the Learning Efficiencies of Realtime Search , 1996, AAAI/IAAI, Vol. 1.

[17]  Jay L. Devore,et al.  Probability and statistics for engineering and the sciences , 1982 .

[18]  L. N. Kanal,et al.  Uncertainty in Artificial Intelligence 5 , 1990 .

[19]  Xiaotie Deng,et al.  Exploring an unknown graph , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[20]  Oren Etzioni,et al.  Embedding Decision-Analytic Control in a Learning Architecture , 1991, Artif. Intell..

[21]  Mihalis Yannakakis,et al.  Shortest Paths Without a Map , 1989, Theor. Comput. Sci..

[22]  Richard E. Korf,et al.  Moving-Target Search: A Real-Time Search for Changing Goals , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Mona Singh,et al.  Piecemeal Learning of an Unknown Environment , 1993, COLT.

[24]  P. Haddawy,et al.  Eecient Decision-theoretic Planning: Techniques and Empirical Analysis , 1995 .

[25]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[26]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[27]  Avrim Blum,et al.  An on-line algorithm for improving performance in navigation , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[28]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[29]  Robert E. Schapire,et al.  Inference of Finite Automata Using Homing Sequences (Extended Abstract) , 1989, STOC 1989.

[30]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[31]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[32]  R. Bellman Dynamic programming. , 1957, Science.

[33]  Mario Tokoro,et al.  The Trailblazer Search: A New Method for Searching and Capturing Moving Targets , 1994, AAAI.