Learning and planning in structured worlds

This thesis is concerned with the problem of how to make decisions in an uncertain world. We use a model of uncertainty based on Markov decision problems, and develop a number of algorithms for decision-making both for the planning problem, in which the model is known in advance, and for the reinforcement learning problem in which the decision-making agent does not know the model and must learn to make good decisions by trial and error. The basis for much of this work is the use of structural representations of problems. If a problem is represented in a structured way we can compute or learn plans that take advantage of this structure for computational gains. This is because the structure allows us to perform abstraction. Rather than reasoning about each situation in which a decision must be made individually, abstraction allows us to group situations together and reason about a whole set of them in a single step. Our approach to abstraction has the additional advantage that we can dynamically change the level of abstraction, splitting a group of situations in two if they need to be reasoned about separately to find an acceptable plan, or merging two groups together if they no longer need to be distinguished. We present two planning algorithms and one learning algorithm that use this approach. A second idea we present in this thesis is a novel approach to the exploration problem in reinforcement learning. The problem is to select actions to perform given that we would like good performance now and in the future. We can select the current best action to perform, but this may prevent us from discovering that another action is better, or we can take an exploratory action, but we risk performing poorly now as a result. Our Bayesian approach makes this tradeoff explicit by representing our uncertainty about the values of states and using this measure of uncertainty to estimate the value of the information we could gain by performing each action. We present both model-free and model-based reinforcement learning algorithms that make use of this exploration technique. Finally, we show how these ideas fit together to produce a reinforcement learning algorithm that uses structure to represent both the problem being solved and the plan it learns, and that selects actions to perform in order to learn using our Bayesian approach to exploration.

[1]  Mark A. Peot,et al.  Postponing Threats in Partial-Order Planning , 1993, AAAI.

[2]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[3]  Ronald A. Howard,et al.  Dynamic Probabilistic Systems , 1971 .

[4]  John L. Pollock,et al.  The Logical Foundations of Goal-Regression Planning in Autonomous Agents , 1998, Artif. Intell..

[5]  Wai Lam,et al.  Using Causal Information and Local Measures to Learn Bayesian Networks , 1993, UAI.

[6]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[7]  R. I. Bahar,et al.  Algebraic decision diagrams and their applications , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[8]  Leslie Pack Kaelbling,et al.  Toward Approximate Planning in Very Large Stochastic Domains , 1994, AAAI 1994.

[9]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[10]  Ronen I. Brafman,et al.  Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning , 1997, IJCAI.

[11]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[12]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[13]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[14]  Mark D. Johnston,et al.  Scheduling with neural networks - the case of the hubble space telescope , 1992, Comput. Oper. Res..

[15]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[16]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[17]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[18]  Ross D. Shachter Evaluating Influence Diagrams , 1986, Oper. Res..

[19]  Craig Boutilier,et al.  Context-Specific Independence in Bayesian Networks , 1996, UAI.

[20]  Andrew G. Barto,et al.  On the Computational Economics of Reinforcement Learning , 1991 .

[21]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[22]  Stuart J. Russell,et al.  Do the right thing - studies in limited rationality , 1991 .

[23]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[24]  Moisés Goldszmidt,et al.  Action Networks: A Framework for Reasoning about Actions and Change under Uncertainty , 1994, UAI.

[25]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[26]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[27]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[28]  R. M. Oliver,et al.  Influence diagrams, belief nets and decision analysis , 1992 .

[29]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[30]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[31]  Marcel Schoppers,et al.  Universal Plans for Reactive Robots in Unpredictable Environments , 1987, IJCAI.

[32]  Kristian G. Olesen,et al.  HUGIN - A Shell for Building Bayesian Belief Universes for Expert Systems , 1989, IJCAI.

[33]  David A. McAllester,et al.  Systematic Nonlinear Planning , 1991, AAAI.

[34]  Clausin D. Hadley,et al.  Some Theory of Sampling , 1950 .

[35]  David Heckerman,et al.  Probabilistic similarity networks , 1991, Networks.

[36]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[37]  Craig Boutilier,et al.  Correlated Action Effects in Decision Theoretic Regression , 1997, UAI.

[38]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[39]  Brian Drabble Mission scheduling for spacecraft: Diaries of T-SCHED , 1990, Expert Planning Systems.

[40]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[41]  Craig Boutilier,et al.  Abstraction and Approximate Decision-Theoretic Planning , 1997, Artif. Intell..

[42]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[43]  Peter Haddawy,et al.  Decision-theoretic Refinement Planning Using Inheritance Abstraction , 1994, AIPS.

[44]  Ronald A. Howard,et al.  Information Value Theory , 1966, IEEE Trans. Syst. Sci. Cybern..

[45]  Robert Givan,et al.  Bounded Parameter Markov Decision Processes , 1997, ECP.

[46]  Richard William Dearden,et al.  Abstraction and search for decision-theoretic planning , 1994 .

[47]  Nir Friedman,et al.  Learning Bayesian Networks with Local Structure , 1996, UAI.

[48]  Thomas G. Dietterich,et al.  Explanation-Based Learning and Reinforcement Learning: A Unified View , 1995, Machine-mediated learning.

[49]  Yoram Singer,et al.  Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[50]  Ibrahim N. Hajj,et al.  Parallel circuit simulation on supercomputers , 1989 .

[51]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[52]  Ronald A. Howard,et al.  Readings on the Principles and Applications of Decision Analysis , 1989 .

[53]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[54]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[55]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[56]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[57]  Mark Stefik,et al.  Planning and Meta-Planning (MOLGEN: Part 2) , 1981, Artif. Intell..

[58]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[59]  Craig Boutilier,et al.  Integrating Planning and Execution in Stochastic Domains , 1994, UAI.

[60]  Peter Haddawy,et al.  Toward Case-Based Preference Elicitation: Similarity Measures on Preference Structures , 1998, UAI.

[61]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[62]  Daniel S. Weld,et al.  A Probablistic Model of Action for Least-Commitment Planning with Information Gathering , 1994, UAI.

[63]  Claude-Nicolas Fiechter,et al.  Design and analysis of efficient reinforcement learning algorithms , 1997 .

[64]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[65]  Leslie Pack Kaelbling,et al.  Planning With Deadlines in Stochastic Domains , 1993, AAAI.

[66]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[67]  Ronen I. Brafman,et al.  Structured Reachability Analysis for Markov Decision Processes , 1998, UAI.

[68]  Jesse Hoey,et al.  SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[69]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[70]  Peter Haddawy,et al.  Abstracting Probabilistic Actions , 1994, UAI.

[71]  F. Fairman Introduction to dynamic systems: Theory, models and applications , 1979, Proceedings of the IEEE.

[72]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[73]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[74]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[75]  Robert Givan,et al.  Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes , 1997, UAI.

[76]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[77]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[78]  Satinder P. Singh,et al.  How to Dynamically Merge Markov Decision Processes , 1997, NIPS.

[79]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[80]  Long Ji Lin,et al.  Programming Robots Using Reinforcement Learning and Teaching , 1991, AAAI.

[81]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[82]  Nicholas Kushmerick,et al.  An Algorithm for Probabilistic Least-Commitment Planning , 1994, AAAI.

[83]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[84]  Brian D. Ripley,et al.  Stochastic Simulation , 2005 .

[85]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[86]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[87]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..