Efficient Planning in Large MDPs with Weak Linear Function Approximation

Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.

[1]  Dale Schuurmans,et al.  Direct value-approximation for factored MDPs , 2001, NIPS.

[2]  A. Juditsky,et al.  5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , I : General Purpose Methods , 2010 .

[3]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[4]  Gergely Neu,et al.  Faster saddle-point optimization for solving large-scale Markov decision processes , 2020, L4DC.

[5]  Randy Cogill,et al.  Primal-dual algorithms for discounted Markov decision processes , 2015, 2015 European Control Conference (ECC).

[6]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[7]  D. J. White,et al.  A Survey of Applications of Markov Decision Processes , 1993 .

[8]  Yasin Abbasi-Yadkori,et al.  Optimizing over a Restricted Policy Class in MDPs , 2019, AISTATS.

[9]  Richard J. Boucherie,et al.  Markov decision processes in practice , 2017 .

[10]  Tao Wang,et al.  Stable Dual Dynamic Programming , 2007, NIPS.

[11]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[12]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[13]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[14]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[15]  Vivek F. Farias,et al.  A Smoothed Approximate Linear Program , 2009, NIPS.

[16]  John Rust Numerical dynamic programming in economics , 1996 .

[17]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[18]  Mykel J. Kochenderfer,et al.  Limiting Extrapolation in Linear Approximate Value Iteration , 2019, NeurIPS.

[19]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[20]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[21]  A. Juditsky,et al.  Solving variational inequalities with Stochastic Mirror-Prox algorithm , 2008, 0809.0815.

[22]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[23]  Xi Chen,et al.  Large-Scale Markov Decision Problems via the Linear Programming Dual , 2019, ArXiv.

[24]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[25]  Chi Jin Provably Efficient Reinforcement Learning with Linear Function Approximation , 2020 .

[26]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[27]  Peter L. Bartlett,et al.  Linear Programming for Large-Scale Markov Decision Problems , 2014, ICML.

[28]  Mengdi Wang,et al.  Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[29]  Zheng Wen,et al.  Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[30]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[31]  A. Juditsky 6 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , II : Utilizing Problem ’ s Structure , 2010 .

[32]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[33]  Yao-Liang Yu The Strong Convexity of von Neumann’s Entropy , 2015 .

[34]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[35]  Shalabh Bhatnagar,et al.  A Linearly Relaxed Approximate Linear Program for Markov Decision Processes , 2017, IEEE Transactions on Automatic Control.

[36]  Csaba Szepesvári,et al.  Efficient approximate planning in continuous space Markovian Decision Problems , 2001, AI Commun..

[37]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[38]  Marek Petrik,et al.  Constraint relaxation in approximate linear programs , 2009, ICML '09.

[39]  Benjamin Van Roy,et al.  Comments on the Du-Kakade-Wang-Yang Lower Bounds , 2019, ArXiv.

[40]  Tor Lattimore,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2020, ICML.

[41]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[42]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[43]  Vivek F. Farias,et al.  Non-parametric Approximate Dynamic Programming via the Kernel Method , 2012, NIPS.

[44]  John N. Tsitsiklis,et al.  A survey of computational complexity results in systems and control , 2000, Autom..

[45]  Carmel Domshlak,et al.  Simple Regret Optimization in Online Planning for Markov Decision Processes , 2012, J. Artif. Intell. Res..

[46]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[47]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.