A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies

We consider stochastic optimal control models with Borel spaces and universally measurable policies. For such models the standard policy iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed value and policy iteration method that circumvents this difficulty. The method allows the use of stationary policies in computing the optimal cost function in a manner that resembles policy iteration. It can also be used to address similar difficulties of policy iteration in the context of upper and lower semicontinuous models. We analyze the convergence of the method in infinite horizon total cost problems for the discounted case where the one-stage costs are bounded and for the undiscounted case where the one-stage costs are nonpositive or nonnegative. For undiscounted total cost problems with nonnegative one-stage costs, we also give a new convergence theorem for value iteration that shows that value iteration converges whenever it is initialized with a function that is above the optimal cost function and yet bounded by a multiple of the optimal cost function. This condition resembles Whittle’s bridging condition and is partly motivated by it. The theorem is also partly motivated by a result of Maitra and Sudderth that showed that value iteration, when initialized with the constant function zero, could require a transfinite number of iterations to converge. We use the new convergence theorem for value iteration to establish the convergence of our mixed value and policy iteration method for the nonnegative cost case.

[1]  D. Blackwell,et al.  Non-Existence of Everywhere Proper Conditional Distributions , 1963 .

[2]  W. Rudin Principles of mathematical analysis , 1964 .

[3]  D. Blackwell Memoryless Strategies in Finite-Stage Dynamic Programming , 1964 .

[4]  D. Blackwell Discounted Dynamic Programming , 1965 .

[5]  Onésimo Hernández-Lerma,et al.  Controlled Markov Processes , 1965 .

[6]  A. F. Veinott ON FINDING OPTIMAL POLICIES IN DISCRETE DYNAMIC PROGRAMMING WITH NO DISCOUNTING , 1966 .

[7]  K. Parthasarathy PROBABILITY MEASURES IN A METRIC SPACE , 1967 .

[8]  David Blackwell,et al.  Positive dynamic programming , 1967 .

[9]  D. Blackwell A Borel Set Not Containing a Graph , 1968 .

[10]  A. F. Veinott Discrete Dynamic Programming with Sensitive Discount Optimality Criteria , 1969 .

[11]  B. L. Miller,et al.  Discrete Dynamic Programming with a Small Interest Rate , 1969 .

[12]  K. Hinderer,et al.  Foundations of Non-stationary Dynamic Programming with Discrete Time Parameter , 1970 .

[13]  N. Furukawa Markovian Decision Processes with Compact Action Spaces , 1972 .

[14]  D. Bertsekas Infinite time reachability of state-space regions by using feedback control , 1972 .

[15]  D. Blackwell,et al.  The Optimal Reward Operator in Dynamic Programming , 1974 .

[16]  D. Freedman The Optimal Reward Operator in Special Classes of Dynamic Programming Problems , 1974 .

[17]  Evan L. Porteus On the Optimality of Structured Policies in Countable Stage Decision Processes , 1975 .

[18]  Manfred SchÄl,et al.  Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal , 1975 .

[19]  J. Neveu,et al.  Discrete Parameter Martingales , 1975 .

[20]  D. Bertsekas Monotone Mappings with Application in Dynamic Programming , 1977 .

[21]  D. Bertsekas,et al.  Alternative theoretical frameworks for finite horizon discrete-time stochastic optimal control , 1977, 1977 IEEE Conference on Decision and Control including the 16th Symposium on Adaptive Processes and A Special Symposium on Fuzzy Set Theory and Applications.

[22]  Evan L. Porteus,et al.  On the Optimality of Structured Policies in Countable Stage Decision Processes. II: Positive and Negative Problems , 1977 .

[23]  S. Shreve Probability measures and the C-sets of Selivanovskij , 1978 .

[24]  D. Blackwell Borel-Programmable Functions , 1978 .

[25]  Dimitri P. Bertsekas,et al.  Universally Measurable Policies in Dynamic Programming , 1979, Math. Oper. Res..

[26]  P. Whittle A simple condition for regularity in negative programming , 1979, Journal of Applied Probability.

[27]  S. Shreve Resolution of measurability problems in discrete — time stochastic control , 1979 .

[28]  P. Whittle Stability and characterisation conditions in negative programming , 1980, Journal of Applied Probability.

[29]  R. Hartley A simple proof of Whittle's bridging condition in dynamic programming , 1980 .

[30]  S. Shreve Borel-approachable functions , 1981 .

[31]  Rolf van Dawen,et al.  Negative Dynamic Programming , 1984 .

[32]  William D. Sudderth,et al.  The Optimal Reward Operator in Negative Dynamic Programming , 1992, Math. Oper. Res..

[33]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[34]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[35]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[36]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[37]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[38]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[39]  Vivek S. Borkar,et al.  Stochastic Approximation for Nonexpansive Maps: Application to Q-Learning Algorithms , 1997, SIAM J. Control. Optim..

[40]  W. Fleming Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[41]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[42]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[43]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[44]  E. Altman Constrained Markov Decision Processes , 1999 .

[45]  O. Hernández-Lerma,et al.  Further topics on discrete-time Markov control processes , 1999 .

[46]  Sean P. Meyn,et al.  Value iteration and optimization of multiclass queueing networks , 1999, Queueing Syst. Theory Appl..

[47]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[48]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[49]  Eugene A. Feinberg,et al.  Total Reward Criteria , 2002 .

[50]  Eugene A. Feinberg,et al.  Handbook of Markov Decision Processes , 2002 .

[51]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[52]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[53]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[54]  Shashi M. Srivastava,et al.  A Course on Borel Sets , 1998, Graduate texts in mathematics.

[55]  Dimitri P. Bertsekas,et al.  Distributed asynchronous policy iteration in dynamic programming , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[56]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, CDC.

[57]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[58]  Eugene A. Feinberg,et al.  Average Cost Markov Decision Processes with Weakly Continuous Transition Probabilities , 2012, Math. Oper. Res..

[59]  Dimitri P. Bertsekas,et al.  On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems , 2013, Math. Oper. Res..

[60]  Dimitri P. Bertsekas,et al.  Q-learning and policy iteration algorithms for stochastic shortest path problems , 2012, Annals of Operations Research.

[61]  Dimitri P. Bertsekas,et al.  Abstract Dynamic Programming , 2013 .

[62]  Huizhen Yu,et al.  On Convergence of Value Iteration for a Class of Total Cost Markov Decision Processes , 2014, SIAM J. Control. Optim..

[63]  Kjetil K. Haugen Stochastic Dynamic Programming , 2016 .

[64]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[65]  O. Gaans Probability measures on metric spaces , 2022 .