Provably Efficient Representation Learning in Low-rank Markov Decision Processes

The success of deep reinforcement learning (DRL) is due to the power of learning a representation that is suitable for the underlying exploration and exploitation task. However, existing provable reinforcement learning algorithms with linear function approximation often assume the feature representation is known and fixed. In order to understand how representation learning can improve the efficiency of RL, we study representation learning for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose a provably efficient algorithm called ReLEX that can simultaneously learn the representation and perform exploration. We show that ReLEX always performs no worse than a state-of-the-art algorithm without representation learning, and will be strictly better in terms of sample efficiency if the function class of representations enjoys a certain mild “coverage” property over the whole state-action space.

[1]  Yasin Abbasi-Yadkori,et al.  Regret Balancing for Bandit and RL Model Selection , 2020, ArXiv.

[2]  Akshay Krishnamurthy,et al.  FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , 2020, NeurIPS.

[3]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[4]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[5]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[6]  Alessandro Lazaric,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[7]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[8]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[9]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[10]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[11]  Haipeng Luo,et al.  Model selection for contextual bandits , 2019, NeurIPS.

[12]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[13]  Michael L. Littman,et al.  Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[14]  Haipeng Luo,et al.  Corralling a Band of Bandit Algorithms , 2016, COLT.

[15]  Peter L. Bartlett,et al.  OSOM: A Simultaneously Optimal Algorithm for Multi-Armed and Linear Contextual Bandits , 2019, AISTATS.

[16]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[17]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[18]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[19]  Alessandro Lazaric,et al.  Leveraging Good Representations in Linear Contextual Bandits , 2021, ICML.

[20]  David Simchi-Levi,et al.  Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective , 2020, COLT.

[21]  Quanquan Gu,et al.  Logarithmic Regret for Reinforcement Learning with Linear Function Approximation , 2020, ICML.

[22]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[23]  Bo Dai,et al.  Offline Policy Selection under Uncertainty , 2020, AISTATS.

[24]  Claudio Gentile,et al.  Regret Bound Balancing and Elimination for Model Selection in Bandits and RL , 2020, ArXiv.

[25]  Julian Zimmert,et al.  Model Selection in Contextual Stochastic Bandit Problems , 2020, NeurIPS.

[26]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[27]  Haipeng Luo,et al.  Open Problem: Model Selection for Contextual Bandits , 2020, COLT.

[28]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[29]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[30]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[31]  Haipeng Luo,et al.  Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation , 2020, AISTATS.

[32]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[33]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[34]  Kannan Ramchandran,et al.  Problem-Complexity Adaptive Model Selection for Stochastic Linear Bandits , 2021, AISTATS.

[35]  Akshay Krishnamurthy,et al.  Model-free Representation Learning and Exploration in Low-rank MDPs , 2021, ArXiv.

[36]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[37]  Lin F. Yang,et al.  Q-learning with Logarithmic Regret , 2020, AISTATS.

[38]  Karl Tuyls,et al.  Integrating State Representation Learning Into Deep Reinforcement Learning , 2018, IEEE Robotics and Automation Letters.

[39]  Rémi Munos,et al.  Adaptive Bandits: Towards the best history-dependent strategy , 2011, AISTATS.

[40]  Pieter Abbeel,et al.  Decoupling Representation Learning from Reinforcement Learning , 2020, ICML.

[41]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[42]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[43]  Balaraman Ravindran,et al.  Model Minimization in Hierarchical Reinforcement Learning , 2002, SARA.

[44]  Ambuj Tewari,et al.  Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles , 2019, AISTATS.