An online primal-dual method for discounted Markov decision processes

We consider the online solution of discounted Markov decision processes (MDP). We focus on the black-box learning model where transition probabilities and state transition cost are unknown. Instead, a simulator is available to generate random state transitions under given actions. We propose a stochastic primal-dual algorithm for solving the linear formulation of the Bellman equation. The algorithm updates the primal and dual iterates by using sample state transitions and sample costs generated by the simulator. We provide a thresholding procedure that recovers the exact optimal policy from the dual iterates with high probability.

[1]  Yunmei Chen,et al.  Optimal Primal-Dual Methods for a Class of Saddle Point Problems , 2013, SIAM J. Optim..

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[4]  Bo Liu,et al.  Sparse Q-learning with Mirror Descent , 2012, UAI.

[5]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[6]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[7]  Randy Cogill,et al.  Primal-dual algorithms for discounted Markov decision processes , 2015, 2015 European Control Conference (ECC).

[8]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[9]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[10]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[11]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[14]  Yinyu Ye,et al.  The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..

[15]  Sujin Kim,et al.  The stochastic root-finding problem: Overview, solutions, and open questions , 2011, TOMC.

[16]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[17]  Raghu Pasupathy,et al.  Simulation Optimization: A Concise Overview and Implementation Guide , 2013 .

[18]  Dale Schuurmans,et al.  Dual Temporal Difference Learning , 2009, AISTATS.

[19]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[20]  Mohammad Gheshlaghi Azar,et al.  On the theory of reinforcement learning : methods, convergence analysis and sample complexity , 2012 .

[21]  Yuan Tian,et al.  Understanding intra-urban trip patterns from taxi trajectory data , 2012, Journal of Geographical Systems.

[22]  Peter L. Bartlett,et al.  Linear Programming for Large-Scale Markov Decision Problems , 2014, ICML.

[23]  Guanghui Lan,et al.  Randomized First-Order Methods for Saddle Point Optimization , 2014, 1409.8625.

[24]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[25]  Michael H. Veatch,et al.  Approximate Linear Programming for Average Cost MDPs , 2013, Math. Oper. Res..

[26]  R. Rubinstein,et al.  An Efficient Stochastic Approximation Algorithm for Stochastic Saddle Point Problems , 2005 .