Solving Factored MDPs with Hybrid State and Action Variables

Efficient representations and solutions for large decision problems with continuous and discrete variables are among the most important challenges faced by the designers of automated decision support systems. In this paper, we describe a novel hybrid factored Markov decision process (MDP) model that allows for a compact representation of these problems, and a new hybrid approximate linear programming (HALP) framework that permits their efficient solutions. The central idea of HALP is to approximate the optimal value function by a linear combination of basis functions and optimize its weights by linear programming. We analyze both theoretical and computational aspects of this approach, and demonstrate its scale-up potential on several hybrid optimization problems.

[1]  Michael A. Trick,et al.  A Linear Programming Approach to Solving Stochastic Dynamic Programming , 1993 .

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[4]  C. Guestrin,et al.  Solving Factored MDPs with Hybrid State and Action Variables , 2006, J. Artif. Intell. Res..

[5]  R. Bellman,et al.  Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1963 .

[6]  L. Khachiyan Polynomial algorithms in linear programming , 1980 .

[7]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[8]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[9]  Milos Hauskrecht,et al.  Learning Basis Functions in Hybrid Domains , 2006, AAAI.

[10]  Shobha Venkataraman,et al.  Context-specific multiagent coordination and planning with factored MDPs , 2002, AAAI/IAAI.

[11]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[12]  Sridhar Mahadevan,et al.  Samuel Meets Amarel: Automating Value Function Approximation Using Global State Space Analysis , 2005, AAAI.

[13]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[14]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[15]  J. Tsitsiklis,et al.  An optimal one-way multigrid algorithm for discrete-time stochastic control , 1991 .

[16]  Scott Sanner,et al.  Approximate Linear Programming for First-order MDPs , 2005, UAI.

[17]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[18]  Ronald A. Howard,et al.  Influence Diagrams , 2005, Decis. Anal..

[19]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[20]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[21]  Sridhar Mahadevan,et al.  Learning Representation and Control in Continuous Markov Decision Processes , 2006, AAAI.

[22]  Doina Precup,et al.  Metrics for Markov Decision Processes with Infinite State Spaces , 2005, UAI.

[23]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[24]  Luis E. Ortiz,et al.  Selecting approximately-optimal actions in complex structured domains , 2002 .

[25]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Frank Jensen,et al.  From Influence Diagrams to junction Trees , 1994, UAI.

[27]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[28]  Craig Boutilier,et al.  Greedy linear value-approximation for factored Markov decision processes , 2002, AAAI/IAAI.

[29]  D. Higdon Auxiliary Variable Methods for Markov Chain Monte Carlo with Applications , 1998 .

[30]  Adnan Darwiche,et al.  Approximating MAP using Local Search , 2001, UAI.

[31]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[32]  A. Michael,et al.  A Linear Programming Approach toSolving Stochastic Dynamic Programs , 1993 .

[33]  D. Koller,et al.  Planning under uncertainty in complex structured environments , 2003 .

[34]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[35]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[36]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[37]  Dr. M. G. Worster Methods of Mathematical Physics , 1947, Nature.

[38]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[39]  Sridhar Mahadevan,et al.  Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions , 2005, NIPS.

[40]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[41]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[42]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[43]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[44]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[45]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[46]  Carlos Guestrin,et al.  Generalizing plans to new environments in relational MDPs , 2003, IJCAI 2003.

[47]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[48]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[49]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[50]  Milos Hauskrecht,et al.  An MCMC Approach to Solving Hybrid Factored MDPs , 2005, IJCAI.

[51]  Gregory F. Cooper,et al.  A Method for Using Belief Networks as Influence Diagrams , 2013, UAI 1988.

[52]  Daphne Koller,et al.  Computing Factored Value Functions for Policies in Structured MDPs , 1999, IJCAI.

[53]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[54]  Dale Schuurmans,et al.  Direct value-approximation for factored MDPs , 2001, NIPS.

[55]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[56]  Dimitri P. Bertsekas,et al.  A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[57]  L. G. H. Cijan A polynomial algorithm in linear programming , 1979 .

[58]  Carlos Guestrin,et al.  Max-norm Projections for Factored MDPs , 2001, IJCAI.

[59]  Milos Hauskrecht,et al.  Solving Factored MDPs with Exponential-Family Transition Models , 2006, ICAPS.

[60]  Milos Hauskrecht,et al.  Solving Factored MDPs with Continuous and Discrete Variables , 2004, UAI.

[61]  Adnan Darwiche,et al.  Solving MAP Exactly using Systematic Search , 2002, UAI.

[62]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[63]  Zhengzhu Feng,et al.  Dynamic Programming for Structured Continuous Markov Decision Problems , 2004, UAI.

[64]  David E. Smith,et al.  Planning Under Continuous Time and Resource Uncertainty: A Challenge for AI , 2002, AIPS Workshop on Planning for Temporal Domains.

[65]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[66]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[67]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[68]  Changhe Yuan,et al.  Annealed MAP , 2004, UAI.

[69]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[70]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[71]  Milos Hauskrecht,et al.  Linear Program Approximations for Factored Continuous-State Markov Decision Processes , 2003, NIPS.

[72]  Milos Hauskrecht,et al.  Heuristic Refinements of Approximate Linear Programming for Factored Continuous-State Markov Decision Processes , 2004, ICAPS.