The Linear Programming Approach to Approximate Dynamic Programming

The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large-scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach "fits" a linear combination of pre-selected basis functions to the dynamic programming cost-to-go function. We develop error bounds that offer performance guarantees and also guide the selection of both basis functions and "state-relevance weights" that influence quality of the approximation. Experimental results in the domain of queueing network control provide empirical support for the methodology.

[1]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[2]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[3]  F. d'Epenoux,et al.  A Probabilistic Production and Inventory Problem , 1963 .

[4]  Rutherford Aris,et al.  Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[5]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[6]  A. F. Veinott Discrete Dynamic Programming with Sensitive Discount Optimality Criteria , 1969 .

[7]  E. Denardo On Linear Programming in a Markov Decision Problem , 1970 .

[8]  R. Dudley Central Limit Theorems for Empirical Measures , 1978 .

[9]  A. Hordijk,et al.  Linear Programming and Markov Decision Chains , 1979 .

[10]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[11]  V. Borkar A convex analytic approach to Markov decision processes , 1988 .

[12]  David Haussler,et al.  Equivalence of models for polynomial learnability , 1988, COLT '88.

[13]  P. R. Kumar,et al.  Dynamic instabilities and stabilization methods in distributed real-time scheduling of manufacturing systems , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[14]  R. Durrett Probability: Theory and Examples , 1993 .

[15]  Martin Grötschel,et al.  Solution of large-scale symmetric travelling salesman problems , 1991, Math. Program..

[16]  A. Michael,et al.  A Linear Programming Approach toSolving Stochastic Dynamic Programs , 1993 .

[17]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[18]  Sean P. Meyn,et al.  Duality and linear programs for stability and performance analysis of queueing networks and scheduling policies , 1994, Proceedings of 1994 33rd IEEE Conference on Decision and Control.

[19]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[20]  Kenneth L. Clarkson,et al.  Las Vegas algorithms for linear and integer programming when the dimension is small , 1995, JACM.

[21]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[22]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[23]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[24]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[25]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[26]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[27]  Mathukumalli Vidyasagar,et al.  A Theory of Learning and Generalization , 1997 .

[28]  Stanley E. Zin,et al.  SPLINE APPROXIMATIONS TO VALUE FUNCTIONS: Linear Programming Approach , 1997 .

[29]  P. Marbach Simulation-Based Methods for Markov Decision Processes , 1998 .

[30]  Benjamin Van Roy Learning and value function approximation in complex decision processes , 1998 .

[31]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[32]  Christine A. Shoemaker,et al.  Applying Experimental Design and Regression Splines to High-Dimensional Continuous-State Stochastic Dynamic Programming , 1999, Oper. Res..

[33]  J. R. Morrison,et al.  New Linear Program Performance Bounds for Queueing Networks , 1999 .

[34]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[35]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[36]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[37]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[38]  Sean P. Meyn,et al.  Value iteration and optimization of multiclass queueing networks , 1999, Queueing Syst. Theory Appl..

[39]  On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning , 2000 .

[40]  John N. Tsitsiklis,et al.  Call admission control and routing in integrated services networks using neuro-dynamic programming , 2000, IEEE Journal on Selected Areas in Communications.

[41]  John N. Tsitsiklis,et al.  Congestion-dependent pricing of network services , 2000, TNET.

[42]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[43]  John N. Tsitsiklis,et al.  Regression methods for pricing complex American-style options , 2001, IEEE Trans. Neural Networks.

[44]  Francis A. Longstaff,et al.  Valuing American Options by Simulation: A Simple Least-Squares Approach , 2001 .

[45]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[46]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[47]  Sean P. Meyn Sequencing and Routing in Multiclass Queueing Networks Part I: Feedback Regulation , 2001, SIAM J. Control. Optim..

[48]  Dale Schuurmans,et al.  Direct value-approximation for factored MDPs , 2001, NIPS.

[49]  J. Tsitsiklis,et al.  Performance of Multiclass Markovian Queueing Networks Via Piecewise Linear Lyapunov Functions , 2001 .

[50]  Mark S. Squillante,et al.  On maximizing service-level-agreement profits , 2001, PERV.

[51]  Mark S. Squillante,et al.  On maximizing service-level-agreement profits , 2001, EC.

[52]  Benjamin Van Roy Neuro-Dynamic Programming: Overview and Recent Trends , 2002 .

[53]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[54]  Sean P. Meyn Sequencing and Routing in Multiclass Queueing Networks Part II: Workload Relaxations , 2003, SIAM J. Control. Optim..

[55]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[56]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[57]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[58]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[59]  Giuseppe Carlo Calafiore,et al.  Uncertain convex programs: randomized solutions and confidence levels , 2005, Math. Program..

[60]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[61]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.