An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming

We present a simulation-based algorithm called "Simulated Annealing Multiplicative Weights" (SAMW) for solving large finite-horizon stochastic dynamic programming problems. At each iteration of the algorithm, a probability distribution over candidate policies is updated by a simple multiplicative weight rule, and with proper annealing of a control parameter, the generated sequence of distributions converges to a distribution concentrated only on the best policies. The algorithm is "asymptotically efficient," in the sense that for the goal of estimating the value of an optimal policy, a provably convergent finite-time upper bound for the sample mean is obtained

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  Christos G. Cassandras,et al.  Ordinal optimisation and simulation , 2000, J. Oper. Res. Soc..

[3]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[6]  Alexander Shapiro,et al.  The Sample Average Approximation Method for Stochastic Discrete Optimization , 2002, SIAM J. Optim..

[7]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[8]  Jason H. Goodfriend,et al.  Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method , 1995 .

[9]  Robert Givan,et al.  Parallel Rollout for Online Solution of Partially Observable Markov Decision Processes , 2004, Discret. Event Dyn. Syst..

[10]  Michael C. Fu,et al.  An Adaptive Sampling Algorithm for Solving Markov Decision Processes , 2005, Oper. Res..

[11]  Kaddour Najim,et al.  Learning automata and stochastic optimization , 1997 .

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  S. Marcus,et al.  An asymptotically efficient algorithm for finite horizon stochastic dynamic programming problems , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[14]  Charles Leake,et al.  Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method , 1994 .

[15]  W. Fleming Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[16]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[17]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[18]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[19]  F. Topsøe BOUNDS FOR ENTROPY AND DIVERGENCE FOR DISTRIBUTIONS OVER A TWO-ELEMENT SET , 2001 .

[20]  Jim Freeman,et al.  Stochastic Processes (Second Edition) , 1996 .

[21]  A. Shwartz,et al.  Guaranteed performance regions in Markovian systems with competing decision makers , 1993, IEEE Trans. Autom. Control..

[22]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.