Randomized Search Methods for Solving Markov Decision Processes and Global Optimization

Abstract : Markov decision process (MDP) models provide a unified framework for modeling and describing sequential decision making problems that arise in engineering, economics and computer science. However, when the underlying problem is modeled by MDPs there is a typical exponential growth in the size of the resultant MDP model with the size of the original problem, which makes practical solution of the MDP models intractable especially for large problems. Moreover, for complex systems, it is often the case that some of the parameters of the MDP models cannot be obtained in a feasible way, but only simulation samples are available. In the first part of this thesis, we develop two sampling/simulation-based numerical algorithms to address the computational difficulties arising from these settings. The proposed algorithms have somewhat different emphasis one algorithm focuses on MDPs with large state spaces but relatively small action spaces and emphasizes on the efficient allocation of simulation samples to find good value function estimates, whereas the other algorithm targets problems with large action spaces but small state spaces, and invokes a population-based approach to avoid carrying out an optimization over the entire action space. We study the convergence properties of these algorithms and report on computational results to illustrate their performance. The second part of this thesis is devoted to the development of a general framework called Model Reference Adaptive Search (MRAS) for solving global optimization problems. The method iteratively updates a parameterized probability distribution on the solution space, so that the sequence of candidate solutions generated from this distribution will converge asymptotically to the global optimum. We provide a particular instantiation of the framework and establish its convergence properties in both continuous and discrete domains.

[1]  J. MacQueen A MODIFIED DYNAMIC PROGRAMMING METHOD FOR MARKOVIAN DECISION PROBLEMS , 1966 .

[2]  R. Bellman,et al.  Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1963 .

[3]  S. Andradóttir A method for discrete stochastic optimization , 1995 .

[4]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[5]  Jon Louis Bentley,et al.  Multidimensional Binary Search Trees in Database Applications , 1979, IEEE Transactions on Software Engineering.

[6]  Michael C. Fu,et al.  A Model Reference Adaptive Search Method for Global Optimization , 2007, Oper. Res..

[7]  John Rust,et al.  Structural estimation of markov decision processes , 1986 .

[8]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for Reinforcement Learning , 2003, ICML.

[9]  Paul Glasserman,et al.  Performance continuity and differentiability in Monte Carlo optimization , 1988, WSC '88.

[10]  Michael C. Fu,et al.  Evolutionary policy iteration for solving Markov decision processes , 2005, IEEE Transactions on Automatic Control.

[11]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[12]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[13]  Xin Yao,et al.  Fast Evolutionary Programming , 1996, Evolutionary Programming.

[14]  P. Glasserman,et al.  Pricing American-style securities using simulation , 1997 .

[15]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[16]  Nicolò Cesa-Bianchi,et al.  Finite-Time Regret Bounds for the Multiarmed Bandit Problem , 1998, ICML.

[17]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[18]  Convergence of Sample Path Optimal Policies for Stochastic Dynamic Programming , 2005 .

[19]  Dirk P. Kroese,et al.  Application of the Cross-Entropy Method to the Buffer Allocation Problem in a Simulation-Based Environment , 2005, Ann. Oper. Res..

[20]  O. Hernández-Lerma,et al.  Error bounds for rolling horizon policies in discrete-time Markov control processes , 1990 .

[21]  Michael C. Fu,et al.  An Adaptive Sampling Algorithm for Solving Markov Decision Processes , 2005, Oper. Res..

[22]  Mauro Birattari,et al.  Model-Based Search for Combinatorial Optimization: A Critical Survey , 2004, Ann. Oper. Res..

[23]  Jiaqiao Hu,et al.  A Model Reference Adaptive Search Method for Stochastic Global Optimization , 2008, Commun. Inf. Syst..

[24]  John Rust Numerical dynamic programming in economics , 1996 .

[25]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[26]  Qingfu Zhang,et al.  On the convergence of a class of estimation of distribution algorithms , 2004, IEEE Transactions on Evolutionary Computation.

[27]  Tito Homem-de-Mello,et al.  A Study on the Cross-Entropy Method for Rare-Event Probability Estimation , 2007, INFORMS J. Comput..

[28]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[29]  Dirk P. Kroese,et al.  The Cross-Entropy Method for Continuous Multi-Extremal Optimization , 2006 .

[30]  S. Andradóttir,et al.  A Simulated Annealing Algorithm with Constant Temperature for Discrete Stochastic Optimization , 1999 .

[31]  János D. Pintér,et al.  Global optimization in action , 1995 .

[32]  Michael C. Fu,et al.  Stochastic optimization using model reference adaptive search , 2005, Proceedings of the Winter Simulation Conference, 2005..

[33]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[34]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[35]  R. Agrawal,et al.  Asymptotically efficient adaptive allocation schemes for controlled Markov chains: finite parameter space , 1989 .

[36]  Barry L. Nelson,et al.  Discrete Optimization via Simulation Using COMPASS , 2006, Oper. Res..

[37]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[38]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[39]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[40]  Sandro Ridella,et al.  Minimizing multimodal functions of continuous variables with the “simulated annealing” algorithmCorrigenda for this article is available here , 1987, TOMS.

[41]  Michael C. Fu,et al.  An Evolutionary Random Policy Search Algorithm for Solving Markov Decision Processes , 2007, INFORMS J. Comput..

[42]  S. Vaida,et al.  Studies in the Mathematical Theory of Inventory and Production , 1958 .

[43]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[44]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[45]  Doe EUne COMPUTATIONAL COMPARISON OF VALUE ITERATION ALGORITHMS FOR DISCOUNTED MARKOV DECISION PROCESSES , .

[46]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[47]  Reuven Y. Rubinstein,et al.  Optimization of computer simulation models with rare events , 1997 .

[48]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[49]  C. Leake Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method , 1994 .

[50]  D. Yan,et al.  Stochastic discrete optimization , 1992 .

[51]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[52]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[53]  Fred Glover,et al.  Tabu Search: A Tutorial , 1990 .

[54]  Michael C. Fu,et al.  Model Reference Adaptive Search : A New Approach to Global Optimization ∗ , 2005 .

[55]  Sigrún Andradóttir,et al.  A Global Search Method for Discrete Stochastic Optimization , 1996, SIAM J. Optim..

[56]  Michael C. Fu,et al.  An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming , 2007, IEEE Transactions on Automatic Control.

[57]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[58]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[59]  Leyuan Shi,et al.  Nested Partitions Method for Global Optimization , 2000, Oper. Res..

[60]  Judy Goldsmith,et al.  Genetic Algorithms for Approximating Solutions to POMDPs , 2007 .

[61]  Mahmoud H. Alrefaei,et al.  A modification of the stochastic ruler method for discrete stochastic optimization , 2001, Eur. J. Oper. Res..

[62]  Robert Givan,et al.  Parallel Rollout for Online Solution of Partially Observable Markov Decision Processes , 2004, Discret. Event Dyn. Syst..

[63]  Danny Barash,et al.  A Genetic Search In Policy Space For Solving Markov Decision Processes , 1999 .

[64]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[65]  Luca Maria Gambardella,et al.  Ant colony system: a cooperative learning approach to the traveling salesman problem , 1997, IEEE Trans. Evol. Comput..

[66]  Walter J. Gutjahr,et al.  A Converging ACO Algorithm for Stochastic Combinatorial Optimization , 2003, SAGA.

[67]  George A. Vouros,et al.  Buffer allocation in unreliable production lines using a knowledge based system , 1998, Comput. Oper. Res..

[68]  David E. Goldberg,et al.  A Survey of Optimization by Building and Using Probabilistic Models , 2002, Comput. Optim. Appl..

[69]  T. L. Graves,et al.  Asymptotically Efficient Adaptive Choice of Control Laws inControlled Markov Chains , 1997 .

[70]  Zelda B. Zabinsky,et al.  Stochastic Adaptive Search for Global Optimization , 2003 .

[71]  Leyuan Shi,et al.  Nested Partitions Method for Stochastic Optimization , 2000 .

[72]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[73]  Gonzalo Navarro,et al.  An effective clustering algorithm to index high dimensional metric spaces , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[74]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[75]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[76]  John N. Tsitsiklis,et al.  A survey of computational complexity results in systems and control , 2000, Autom..

[77]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.