Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

Modern deep learning methods provide effective means to learn good representations. However, is a good representation itself sufficient for sample efficient reinforcement learning? This question has largely been studied only with respect to (worst-case) approximation error, in the more classical approximate dynamic programming literature. With regards to the statistical viewpoint, this question is largely unexplored, and the extant body of literature mainly focuses on conditions which permit sample efficient reinforcement learning with little understanding of what are necessary conditions for efficient reinforcement learning. This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.

[1]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[2]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3]  Andrew Chi-Chih Yao,et al.  Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[4]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[5]  Matthieu Geist,et al.  Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[6]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[7]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[8]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[9]  Sanjoy Dasgupta,et al.  An elementary proof of a theorem of Johnson and Lindenstrauss , 2003, Random Struct. Algorithms.

[10]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[11]  Noga Alon,et al.  Perturbed Identity Matrices Have High Rank: Proof and Applications , 2009, Combinatorics, Probability and Computing.

[12]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[13]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[14]  Baruch Awerbuch,et al.  Online linear optimization and adaptive routing , 2008, J. Comput. Syst. Sci..

[15]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[16]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[17]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[18]  Avi Wigderson,et al.  Rank bounds for design matrices with applications to combinatorial geometry and locally correctable codes , 2010, STOC '11.

[19]  Nan Jiang,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[20]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[21]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[22]  Ruosong Wang,et al.  Classical Algorithms from Quantum and Arthur-Merlin Communication Protocols , 2019, ITCS.

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  Noga Alon,et al.  The approximate rank of a matrix and its algorithmic applications: approximate rank , 2013, STOC '13.

[25]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[26]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[27]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[28]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[29]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[30]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[31]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[32]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[33]  Rémi Munos,et al.  Error Bounds for Approximate Value Iteration , 2005, AAAI.

[34]  Csaba Szepesvari,et al.  Regularization in reinforcement learning , 2011 .

[35]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.

[36]  Sham M. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[37]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[38]  Noga Alon,et al.  The Cover Number of a Matrix and its Algorithmic Applications , 2014, APPROX-RANDOM.

[39]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[40]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[41]  Nan Jiang,et al.  On Polynomial Time PAC Reinforcement Learning with Rich Observations , 2018, ArXiv.

[42]  Tor Lattimore,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2020, ICML.

[43]  Zheng Wen,et al.  Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[44]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[45]  G. Lorentz Metric entropy and approximation , 1966 .

[46]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[47]  Benjamin Van Roy,et al.  Comments on the Du-Kakade-Wang-Yang Lower Bounds , 2019, ArXiv.