Simulation-Based Optimization Algorithms for Finite-Horizon Markov Decision Processes

We develop four simulation-based algorithms for finite-horizon Markov decision processes. Two of these algorithms are developed for finite state and compact action spaces while the other two are for finite state and finite action spaces. Of the former two, one algorithm uses a linear parameterization for the policy, resulting in reduced memory complexity. Convergence analysis is briefly sketched and illustrative numerical experiments with the four algorithms are shown for a problem of flow control in communication networks.

[1]  Michael C. Fu,et al.  Optimal structured feedback policies for ABR flow control using two-timescale SPSA , 2001, TNET.

[2]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[3]  Shalabh Bhatnagar,et al.  A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes , 2004, IEEE Transactions on Automatic Control.

[4]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[5]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[6]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[7]  Shalabh Bhatnagar,et al.  Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization , 2007, TOMC.

[8]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[9]  Jonathan Chin Cisco Frame Relay Solutions Guide , 2004 .

[10]  David C. Parkes,et al.  Approximately Efficient Online Mechanism Design , 2004, NIPS.

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  Nicolò Cesa-Bianchi,et al.  Finite-Time Regret Bounds for the Multiarmed Bandit Problem , 1998, ICML.

[13]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[14]  Mark A. Shayman,et al.  Multitime scale Markov decision processes , 2003, IEEE Trans. Autom. Control..

[15]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[16]  Frédérick Garcia,et al.  A Learning Rate Analysis of Reinforcement Learning Algorithms in Finite-Horizon , 1998, ICML.

[17]  V. Nollau Kushner, H. J./Clark, D. S., Stochastic Approximation Methods for Constrained and Unconstrained Systems. (Applied Mathematical Sciences 26). Berlin‐Heidelberg‐New York, Springer‐Verlag 1978. X, 261 S., 4 Abb., DM 26,40. US $ 13.20 , 1980 .

[18]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[19]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[20]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[21]  A. Shwartz,et al.  Handbook of Markov decision processes : methods and applications , 2002 .

[22]  S. Marcus,et al.  An asymptotically efficient algorithm for finite horizon stochastic dynamic programming problems , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[23]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[24]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Michael C. Fu,et al.  An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming , 2007, IEEE Transactions on Automatic Control.

[27]  Yasemin Serin A nonlinear programming model for partially observable Markov decision processes: Finite horizon case , 1995 .

[28]  J. Neveu,et al.  Discrete Parameter Martingales , 1975 .

[29]  Chelsea C. White,et al.  A Hybrid Genetic/Optimization Algorithm for Finite-Horizon, Partially Observed Markov Decision Processes , 2004, INFORMS J. Comput..

[30]  David C. Parkes,et al.  An MDP-Based Approach to Online Mechanism Design , 2003, NIPS.

[31]  Morris W. Hirsch,et al.  Convergent activation dynamics in continuous time networks , 1989, Neural Networks.

[32]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[33]  Sheldon M. Ross Introduction to Probability Models. , 1995 .

[34]  Sheldon M. Ross,et al.  Introduction to Probability Models, Eighth Edition , 1972 .

[35]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[36]  Shalabh Bhatnagar,et al.  Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes , 2007, Discret. Event Dyn. Syst..

[37]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[38]  Shalabh Bhatnagar,et al.  Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization , 2005, TOMC.

[39]  James C. Spall,et al.  A one-measurement form of simultaneous perturbation stochastic approximation , 1997, Autom..

[40]  Michael C. Fu,et al.  Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences , 2003, TOMC.

[41]  László Gerencsér,et al.  Optimization over discrete sets via SPSA , 1999, WSC '99.

[42]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[43]  J. Baras,et al.  A Hierarchical Structure For Finite Horizon Dynamic Programming Problems , 2000 .

[44]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[45]  M.C. Fu,et al.  A Markov decision process model for capacity expansion and allocation , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[46]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.