论文信息 - Learning Evaluation Functions for Large Acyclic Domains

Learning Evaluation Functions for Large Acyclic Domains

Some of the most successful recent applications of reinforcement learning have used neural networks and the TD( ) algorithm to learn evaluation functions. In this paper, we examine the intuition that TD( ) operates by approximating asynchronous value iteration. We note that on the important subclass of acyclic tasks, value iteration is ine cient compared with another graph algorithm, DAG-SP, which assigns values to states by working strictly backwards from the goal. We then present ROUT, an algorithm analogous to DAG-SP that can be used in large stochastic state spaces requiring function approximation. We close by comparing the behavior of ROUT and TD on a simple example domain and on two domains with much larger state spaces.

Andrew W. Moore | Justin A. Boyan | A. Moore | J. Boyan

[1] P. W. Jones,et al. Bandit Problems, Sequential Allocation of Experiments , 1987 .

[2] P. Dayan. The Convergence of TD(λ) for General λ , 1992, Machine Learning.

[3] G. Tesauro. Practical Issues in Temporal Difference Learning , 1992 .

[4] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[5] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[6] Michael O. Duff,et al. Q-Learning for Bandit Problems , 1995, ICML.

[7] Dimitri P. Bertsekas,et al. A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[8] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[9] Wei Zhang,et al. A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[10] Csaba Szepesvári,et al. A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[11] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[12] R. K. Shyamasundar,et al. Introduction to algorithms , 1996 .