Approximate solutions to markov decision processes

One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, since the results of its actions are not completely predictable, it is not enough just to compute the correct sequence; instead the robot must sense and correct for deviations from its intended path. In order for any machine learner to act reasonably in an uncertain environment, it must solve problems like the above one quickly and reliably. Unfortunately, the world is often so complicated that it is difficult or impossible to find the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to real-world problems, we usually must settle for approximate solutions. One representation for a learner's environment and goals is a Markov decision process or MDP. MDPs allow us to represent actions that have probabilistic outcomes, and to plan for complicated, temporally-extended goals. An MDP consists of a set of states that the environment can be in, together with rules for how the environment can change state and for what the learner is supposed to do. One way to approach a large MDP is to try to compute an approximation to its optimal state evaluation function, the function which tells us how much reward the learner can be expected to achieve if the world is in a particular state. If the approximation is good enough, we can use a shallow search to find a good action from most states. Researchers have tried many different ways to approximate evaluation functions. This thesis aims for a middle ground, between algorithms that don't scale well because they use an impoverished representation for the evaluation function and algorithms that we can't analyze because they use too complicated a representation.

[1]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[2]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[3]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[4]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[5]  R. Bellman,et al.  Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1962 .

[6]  D. Blackwell Discounted Dynamic Programming , 1965 .

[7]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[8]  James M. Ortega,et al.  Iterative solution of nonlinear equations in several variables , 2014, Computer science and applied mathematics.

[9]  Gerbard Tintner,et al.  Stochastic programming and stochastic control , 1975 .

[10]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[11]  D. Gottlieb,et al.  Numerical analysis of spectral methods : theory and applications , 1977 .

[12]  Peter Dorato,et al.  Dynamic programming and stochastic control , 1978 .

[13]  P. Diaconis,et al.  Conjugate Priors for Exponential Families , 1979 .

[14]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[15]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[16]  C. Watkins Learning from delayed rewards , 1989 .

[17]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[18]  John N. Tsitsiklis,et al.  An optimal multigrid algorithm for discrete-time stochastic control , 1989 .

[19]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[20]  Gerald Tesauro,et al.  Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[21]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[22]  Weiping Li,et al.  Applied Nonlinear Control , 1991 .

[23]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[24]  A. Michael,et al.  A Linear Programming Approach toSolving Stochastic Dynamic Programs , 1993 .

[25]  C. J. Goh,et al.  On the nonlinear optimal regulator problem , 1993, Autom..

[26]  Christopher G. Atkeson,et al.  Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming , 1993, NIPS.

[27]  Michael A. Trick,et al.  A Linear Programming Approach to Solving Stochastic Dynamic Programming , 1993 .

[28]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[29]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[30]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[31]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[32]  Satinder Singh,et al.  An Upper Bound on the Loss from Approximate Optimal-Value Functions , 1994 .

[33]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[34]  Jean-Jacques E. Slotine,et al.  Space-frequency localized basis function networks for nonlinear system estimation and control , 1995, Neurocomputing.

[35]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[36]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[37]  Peter Dayan,et al.  Improving Policies without Measuring Merits , 1995, NIPS.

[38]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[39]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[40]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[41]  Jacek Gondzio,et al.  Implementation of Interior Point Methods for Large Scale Linear Programming , 1996 .

[42]  Scott Davies,et al.  Multidimensional Triangulation and Interpolation for Reinforcement Learning , 1996, NIPS.

[43]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[44]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[45]  Manfred K. Warmuth,et al.  How to use expert advice , 1997, JACM.

[46]  Eduardo Sontag,et al.  Global stabilization of linear discrete-time systems with bounded feedback , 1997 .

[47]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[48]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[49]  Stanley E. Zin,et al.  SPLINE APPROXIMATIONS TO VALUE FUNCTIONS: Linear Programming Approach , 1997 .

[50]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[51]  Andrew W. Moore,et al.  Barycentric Interpolators for Continuous Space and Time Reinforcement Learning , 1998, NIPS.

[52]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[53]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[54]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[55]  Geoffrey J. Gordon Regret bounds for prediction problems , 1999, COLT '99.

[56]  R. Vanderbei LOQO:an interior point code for quadratic programming , 1999 .

[57]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..