Algorithms for Sequential Decision Making

Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question ``What should I do now?'''' In this thesis, I show how to answer this question when ``now'''' is one of a finite set of states, ``do'''' is one of a finite set of actions, ``should'''' is maximize a long-run measure of reward, and ``I'''' is an automated planning or learning system (agent). In particular, I collect basic results concerning methods for finding optimal (or near-optimal) behavior in several different kinds of model environments: Markov decision processes, in which the agent always knows its state; partially observable Markov decision processes (POMDPs), in which the agent must piece together its state on the basis of observations it makes; and Markov games, in which the agent is in direct competition with an opponent. The thesis is written from a computer-science perspective, meaning that many mathematical details are not discussed, and descriptions of algorithms and the complexity of problems are emphasized. New results include an improved algorithm for solving POMDPs exactly over finite horizons, a method for learning minimax-optimal policies for Markov games, a pseudopolynomial bound for policy iteration, and a complete complexity theory for finding zero-reward POMDP policies.

[1]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[2]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[3]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[4]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[5]  Alvin W Drake,et al.  Observation of a Markov process through a noisy channel , 1962 .

[6]  F. d'Epenoux,et al.  A Probabilistic Production and Inventory Problem , 1963 .

[7]  V. Klee On the Number of Vertices of a Convex Polytope , 1964, Canadian Journal of Mathematics.

[8]  Rutherford Aris,et al.  Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[9]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .

[10]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[11]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[12]  V. Klee,et al.  HOW GOOD IS THE SIMPLEX ALGORITHM , 1970 .

[13]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[14]  H. Kushner,et al.  Mathematical programming and the control of Markov chains , 1971 .

[15]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[16]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[17]  L. Goldschlager The monotone and planar circuit value problems are log space complete for P , 1977, SIGA.

[18]  Loren K. Platzman,et al.  Finite memory estimation and control of finite probabilistic systems , 1977 .

[19]  Robert G. Bland,et al.  New Finite Pivoting Rules for the Simplex Method , 1977, Math. Oper. Res..

[20]  Martin L. Puterman,et al.  THE ANALYTIC THEORY OF POLICY ITERATION , 1978 .

[21]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[22]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[23]  K. Sawaki,et al.  OPTIMAL CONTROL FOR PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES OVER AN INFINITE HORIZON , 1978 .

[24]  Martin L. Puterman,et al.  On the Convergence of Policy Iteration in Stationary Dynamic Programming , 1979, Math. Oper. Res..

[25]  C. White,et al.  Application of Jensen's inequality to adaptive suboptimal design , 1980 .

[26]  Nesa L'abbe Wu,et al.  Linear programming and extensions , 1981 .

[27]  J. Filar Ordered field property for stochastic games when the player who controls transitions changes from state to state , 1981 .

[28]  Jan Telgen,et al.  Stochastic Dynamic Programming , 2016 .

[29]  Henryk Wozniakowski,et al.  Complexity of linear programming , 1982, Oper. Res. Lett..

[30]  Stef Tijs,et al.  Fictitious play applied to sequences of games and discounted stochastic games , 1982 .

[31]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[32]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[33]  James N. Eagle The Optimal Search for a Moving Target When the Search Path Is Constrained , 1984, Oper. Res..

[34]  S. Marcus,et al.  Adaptive control of discounted Markov decision chains , 1985 .

[35]  Leslie G. Valiant,et al.  NP is as easy as detecting unique solutions , 1985, STOC '85.

[36]  Karl-Heinz Waldmann,et al.  On Bounds for Dynamic Programs , 1985, Math. Oper. Res..

[37]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[38]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[39]  O. J. Vrieze,et al.  Surveys in game theory and related topics , 1987 .

[40]  S. Verdú,et al.  Abstract dynamic programming models under commutativity conditions , 1987 .

[41]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[42]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[43]  Marcel Schoppers,et al.  Universal Plans for Reactive Robots in Unpredictable Environments , 1987, IJCAI.

[44]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[45]  S. Marcus,et al.  Adaptive control of Markov processes with incomplete state information and unknown parameters , 1987 .

[46]  Chelsea C. White,et al.  Solution Procedures for Partially Observed Markov Decision Processes , 1989, Oper. Res..

[47]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[48]  K. Vrieze Zero-sum stochastic games , 1989 .

[49]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[50]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[51]  David H. Ackley,et al.  Generalization and Scaling in Reinforcement Learning , 1989, NIPS.

[52]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[53]  Anne Condon,et al.  On Algorithms for Simple Stochastic Games , 1990, Advances In Computational Complexity Theory.

[54]  P. Tseng Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[55]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[56]  David H. Ackley,et al.  Interactions between learning and evolution , 1991 .

[57]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[58]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[59]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[60]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[61]  David H. Ackley,et al.  Adaptation in Constant Utility Non-Stationary Environments , 1991, ICGA.

[62]  Anne Condon,et al.  The Complexity of Stochastic Games , 1992, Inf. Comput..

[63]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[64]  D. Koller,et al.  The complexity of two-person zero-sum games in extensive form , 1992 .

[65]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[66]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[67]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[68]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[69]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[70]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[71]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[72]  Andrew G. Barto,et al.  Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.

[73]  Sridhar Mahadevan,et al.  Rapid Task Learning for Real Robots , 1993 .

[74]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[75]  Satinder Singh,et al.  Learning to Solve Markovian Decision Processes , 1993 .

[76]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[77]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[78]  Michael L. Littman,et al.  Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[79]  Leemon C Baird,et al.  Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[80]  Gary McGraw,et al.  Emergent Control and Planning in an Autonomous Vehicle , 1993 .

[81]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[82]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[83]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[84]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[85]  Steven I. Marcus,et al.  Controlled Markov processes on the infinite planning horizon: Weighted and overtaking cost criteria , 1994, Math. Methods Oper. Res..

[86]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[87]  Sebastian Thrun,et al.  Learning to Play the Game of Chess , 1994, NIPS.

[88]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[89]  Daniel S. Weld,et al.  Probabilistic Planning with Information Gathering and Contingent Execution , 1994, AIPS.

[90]  Leslie Pack Kaelbling,et al.  Toward Approximate Planning in Very Large Stochastic Domains , 1994, AAAI 1994.

[91]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[92]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[93]  Chelsea C. White,et al.  Finite-Memory Suboptimal Design for Partially Observed Markov Decision Processes , 1994, Oper. Res..

[94]  T. Sejnowski,et al.  The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. , 1994, Learning & memory.

[95]  Dave Cliff,et al.  Adding Temporary Memory to ZCS , 1994, Adapt. Behav..

[96]  M. Littman The Witness Algorithm: Solving Partially Observable Markov Decision Processes , 1994 .

[97]  Jim Blythe,et al.  Planning with External Events , 1994, UAI.

[98]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[99]  Bernhard von Stengel,et al.  Fast algorithms for finding randomized strategies in game trees , 1994, STOC '94.

[100]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[101]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[102]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[103]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[104]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[105]  Nicholas Kushmerick,et al.  An Algorithm for Probabilistic Planning , 1995, Artif. Intell..

[106]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[107]  Reid G. Simmons,et al.  Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[108]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[109]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[110]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[111]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[112]  A. Harry Klopf,et al.  Reinforcement Learning Applied to a Differential Game , 1995, Adapt. Behav..

[113]  Csaba Szepesvári,et al.  General Framework for Reinforcement Learning , 1995 .

[114]  Long Ji Lin,et al.  Reinforcement Learning of Non-Markov Decision Processes , 1995, Artif. Intell..

[115]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[116]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[117]  Walter Ludwig,et al.  A Subexponential Randomized Algorithm for the Simple Stochastic Game Problem , 1995, Inf. Comput..

[118]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[119]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[120]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[121]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[122]  John Rust Numerical dynamic programming in economics , 1996 .

[123]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[124]  Leon A. Petrosyan,et al.  Game Theory (Second Edition) , 1996 .

[125]  T. Dean,et al.  Planning under uncertainty: structural assumptions and computational leverage , 1996 .

[126]  Csaba Szepesv Ari,et al.  Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .

[127]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..