A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees

We introduce a new algorithm based on linear programming for optimization of average-cost Markov decision processes (MDPs). The algorithm approximates the differential cost function of a perturbed MDP via a linear combination of basis functions. We establish a bound on the performance of the resulting policy that scales gracefully with the number of states without imposing the strong Lyapunov condition required by its counterpart in de Farias and Van Roy [de Farias, D. P., B. Van Roy. 2003. The linear programming approach to approximate dynamic programming. Oper. Res.51(6) 850--865]. We investigate implications of this result in the context of a queueing control problem.

[1]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[2]  F. d'Epenoux,et al.  A Probabilistic Production and Inventory Problem , 1963 .

[3]  Moshe Ben-Horim,et al.  A linear programming approach , 1977 .

[4]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[5]  N. Kartashov Inequalities in Theorems of Ergodicity and Stability for Markov Chains with Common Phase Space. I , 1986 .

[6]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[7]  Michael A. Trick,et al.  A Linear Programming Approach to Solving Stochastic Dynamic Programming , 1993 .

[8]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  A. Harry Klopf,et al.  Advantage Updating Applied to a Differrential Game , 1994, NIPS.

[11]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  Sean P. Meyn The policy iteration algorithm for average reward Markov decision processes with general state space , 1997, IEEE Trans. Autom. Control..

[15]  Stanley E. Zin,et al.  SPLINE APPROXIMATIONS TO VALUE FUNCTIONS: Linear Programming Approach , 1997 .

[16]  J. R. Morrison,et al.  New Linear Program Performance Bounds for Queueing Networks , 1999 .

[17]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[18]  Sean P. Meyn,et al.  Value iteration and optimization of multiclass queueing networks , 1999, Queueing Syst. Theory Appl..

[19]  Dale Schuurmans,et al.  Direct value-approximation for factored MDPs , 2001, NIPS.

[20]  Vivek S. Borkar,et al.  Convex Analytic Methods in Markov Decision Processes , 2002 .

[21]  Benjamin Van Roy,et al.  Approximate Linear Programming for Average-Cost Dynamic Programming , 2002, NIPS.

[22]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[23]  D. Koller,et al.  Planning under uncertainty in complex structured environments , 2003 .

[24]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[25]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[26]  Sean P. Meyn,et al.  Performance Evaluation and Policy Selection in Multiclass Networks , 2003, Discret. Event Dyn. Syst..

[27]  Milos Hauskrecht,et al.  Linear Program Approximations for Factored Continuous-State Markov Decision Processes , 2003, NIPS.

[28]  Milos Hauskrecht,et al.  Solving Factored MDPs with Continuous and Discrete Variables , 2004, UAI.

[29]  Approximate Dynamic Programming for Networks : Fluid Models and Constraint Reduction , 2004 .

[30]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[31]  Daniel Adelman,et al.  A Price-Directed Approach to Stochastic Inventory/Routing , 2004, Oper. Res..

[32]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[33]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[34]  Sean P. Meyn Workload models for stochastic networks: value functions and performance evaluation , 2005, IEEE Transactions on Automatic Control.

[35]  Benjamin Van Roy,et al.  Tetris: A Study of Randomized Constraint Sampling , 2006 .