论文信息 - Scalable Bilinear π Learning Using State and Action Features

Scalable Bilinear π Learning Using State and Action Features

Approximate linear programming (ALP) represents one of the major algorithmic families to solve large-scale Markov decision processes (MDP). In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear $\pi$ learning for reinforcement learning when a sampling oracle is provided. This algorithm enjoys a number of advantages. First, it adopts (bi)linear models to represent the high-dimensional value function and state-action distributions, using given state and action features. Its run-time complexity depends on the number of features, not the size of the underlying MDPs. Second, it operates in a fully online fashion without having to store any sample, thus having minimal memory footprint. Third, we prove that it is sample-efficient, solving for the optimal policy to high precision with a sample complexity linear in the dimension of the parameter space.

[1] Benjamin Van Roy,et al. The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[2] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[3] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[4] F. d'Epenoux,et al. A Probabilistic Production and Inventory Problem , 1963 .

[5] Manfred K. Warmuth,et al. Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[6] Dimitri P. Bertsekas,et al. Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[7] Benjamin Van Roy,et al. On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[8] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9] Bo Liu,et al. Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[10] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[11] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[12] Le Song,et al. Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[13] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[14] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[15] Lihong Li,et al. Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[16] Sean P. Meyn,et al. An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[17] R. Sutton,et al. A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[18] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[19] Lihong Li,et al. Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[20] Le Song,et al. Boosting the Actor with Dual Critic , 2017, ICLR.

[21] Gavin Taylor,et al. Value Function Approximation in Noisy Environments Using Locally Smoothed Regularized Approximate Linear Programs , 2012, UAI.

[22] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[23] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24] Mengdi Wang,et al. Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[25] Dale Schuurmans,et al. Dual Temporal Difference Learning , 2009, AISTATS.

[26] Shalabh Bhatnagar,et al. A Linearly Relaxed Approximate Linear Program for Markov Decision Processes , 2017, IEEE Transactions on Automatic Control.

[27] Tao Wang,et al. Stable Dual Dynamic Programming , 2007, NIPS.

[28] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[29] P. Schweitzer,et al. Generalized polynomial approximations in Markovian decision processes , 1985 .

[30] Mengdi Wang,et al. Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems , 2017, ArXiv.

[31] Ali H. Sayed,et al. Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[32] Thomas G. Dietterich,et al. PAC optimal MDP planning with application to invasive species management , 2015, J. Mach. Learn. Res..

[33] Mengdi Wang,et al. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time , 2017, ArXiv.

[34] Mengdi Wang,et al. An online primal-dual method for discounted Markov decision processes , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[35] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[36] Charles Elkan,et al. Reinforcement Learning with a Bilinear Q Function , 2011, EWRL.

[37] Peter L. Bartlett,et al. Linear Programming for Large-Scale Markov Decision Problems , 2014, ICML.

[38] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[39] Thomas G. Dietterich,et al. PAC Optimal Planning for Invasive Species Management: Improved Exploration for Reinforcement Learning from Simulator-Defined MDPs , 2013, AAAI.