Explanation-Based Learning and Reinforcement Learning: A Unified View

In speedup-learning problems, where full descriptions of operators are known, both explanation-based learning (EBL) and reinforcement learning (RL) methods can be applied. This paper shows that both methods involve fundamentally the same process of propagating information backward from the goal toward the starting state. Most RL methods perform this propagation on a state-by-state basis, while EBL methods compute the weakest preconditions of operators, and hence, perform this propagation on a region-by-region basis. Barto, Bradtke, and Singh (1995) have observed that many algorithms for reinforcement learning can be viewed as asynchronous dynamic programming. Based on this observation, this paper shows how to develop dynamic programming versions of EBL, which we call region-based dynamic programming or Explanation-Based Reinforcement Learning (EBRL). The paper compares batch and online versions of EBRL to batch and online versions of point-based dynamic programming and to standard EBL. The results show that region-based dynamic programming combines the strengths of EBL (fast learning and the ability to scale to large state spaces) with the strengths of reinforcement learning algorithms (learning of optimal policies). Results are shown in chess endgames and in synthetic maze tasks.

[1]  Bill Broyles Notes , 1907, The Classical Review.

[2]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[3]  R. Bellman Dynamic programming. , 1957, Science.

[4]  Jon Doyle,et al.  A Truth Maintenance System , 1979, Artif. Intell..

[5]  H. Edelsbrunner A new approach to rectangle intersections part I , 1983 .

[6]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[7]  Russell H. Taylor,et al.  Automatic Synthesis of Fine-Motion Strategies for Robots , 1984 .

[8]  Michael R. Genesereth,et al.  Logic programming , 1985, CACM.

[9]  Ken Thompson,et al.  Retrograde Analysis of Certain Endgames , 1986, J. Int. Comput. Games Assoc..

[10]  Michael A. Erdmann,et al.  Using Backprojections for Fine Motion Planning with Uncertainty , 1986 .

[11]  Eric Horvitz,et al.  Reasoning about beliefs and actions under computational resource constraints , 1987, Int. J. Approx. Reason..

[12]  Jaime G. Carbonell,et al.  Learning effective search control knowledge: an explanation-based approach , 1988 .

[13]  Christopher G. Atkeson,et al.  Using Local Models to Control Movement , 1989, NIPS.

[14]  C. Watkins Learning from delayed rewards , 1989 .

[15]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[16]  Marshall Bern Hidden Surface Removal for Rectangles , 1990, J. Comput. Syst. Sci..

[17]  Paul E. Utgoff,et al.  Explaining Temporal Differences to Create Useful Concepts for Evaluating States , 1990, AAAI.

[18]  Devika Subramanian,et al.  The Utility of EBL in Recursive Domain Theories , 1990, AAAI.

[19]  Claude Sammut,et al.  Is Learning Rate a Good Performance Criterion for Learning? , 1990, ML.

[20]  Steven Minton,et al.  Quantitative Results Concerning the Utility of Explanation-based Learning , 1988, Artif. Intell..

[21]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[22]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[23]  Charles L. Forgy,et al.  Rete: a fast algorithm for the many pattern/many object pattern match problem , 1991 .

[24]  Stuart J. Russell,et al.  Do the right thing - studies in limited rationality , 1991 .

[25]  Stuart J. Russell,et al.  Principles of Metareasoning , 1989, Artif. Intell..

[26]  Alan D. Christiansen Learning to Predict in Uncertain Continuous Tasks , 1992, ML.

[27]  N. Flann Correct abstraction in counter-planning: a knowledge compilation approach , 1992 .

[28]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[29]  Raymond J. Mooney,et al.  Combining FOIL and EBG to Speed-up Logic Programs , 1993, IJCAI.

[30]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[31]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[32]  Thomas G. Dietterich,et al.  Explanation-Based Learning and Reinforcement Learning: A Unified View , 1995, Machine-mediated learning.

[33]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[34]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[35]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[36]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[37]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[38]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[39]  Carlos Guestrin,et al.  Generalizing plans to new environments in relational MDPs , 2003, IJCAI 2003.

[40]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[41]  Tom M. Mitchell,et al.  Explanation-Based Generalization: A Unifying View , 1986, Machine Learning.

[42]  Andrew W. Moore,et al.  The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces , 1993, Machine Learning.

[43]  A. Newell,et al.  Chunking in Soar: The anatomy of a general learning mechanism , 1985, Machine Learning.

[44]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[45]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[46]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[47]  Pat Langley,et al.  An architecture for persistent reactive behavior , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[48]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[49]  Robert Givan,et al.  Relational Reinforcement Learning: An Overview , 2004, ICML 2004.

[50]  Allen Newell,et al.  The problem of expensive chunks and its solution by restricting expressiveness , 1993, Machine Learning.

[51]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[52]  De,et al.  Relational Reinforcement Learning , 2001, Encyclopedia of Machine Learning and Data Mining.