CONVERGENCE OF SIMULATION-BASED POLICY ITERATION

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almost-surely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.

[1]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  J. Beránek RONALD A. HOWARD “Dynamic Programming and Markov Processes,” , 1961 .

[3]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[4]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  S. Resnick Adventures in stochastic processes , 1992 .

[6]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  Bert Fristedt,et al.  A modern approach to probability theory , 1996 .

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  J. Propp,et al.  Exact sampling with coupled Markov chains and applications to statistical mechanics , 1996 .

[15]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[16]  David Bruce Wilson,et al.  How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph , 1998, J. Algorithms.

[17]  Xi-Ren Cao,et al.  The Relations Among Potentials, Perturbation Analysis, and Markov Decision Processes , 1998, Discret. Event Dyn. Syst..

[18]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[19]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[20]  X. Cao,et al.  Single Sample Path-Based Optimization of Markov Chains , 1999 .

[21]  David Bruce Wilson Layered Multishift Coupling for use in Perfect Sampling Algorithms (with a primer on CFTP) , 1999 .

[22]  H. Thorisson Coupling, stationarity, and regeneration , 2000 .

[23]  Xi-Ren Cao,et al.  A unified approach to Markov decision problems and performance sensitivity analysis , 2000, at - Automatisierungstechnik.

[24]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[25]  Martin L. Puterman,et al.  A probabilistic analysis of bias optimality in unichain Markov decision processes , 2001, IEEE Trans. Autom. Control..

[26]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.