Provable Benefit of Multitask Representation Learning in Reinforcement Learning

As representation learning becomes a powerful technique to reduce sample complexity in reinforcement learning (RL) in practice, theoretical understanding of its advantage is still limited. In this paper, we theoretically characterize the benefit of representation learning under the low-rank Markov decision process (MDP) model. We first study multitask low-rank RL (as upstream training), where all tasks share a common representation, and propose a new multitask reward-free algorithm called REFUEL. REFUEL learns both the transition kernel and the near-optimal policy for each task, and outputs a well-learned representation for downstream tasks. Our result demonstrates that multitask representation learning is provably more sample-efficient than learning each task individually, as long as the total number of tasks is above a certain threshold. We then study the downstream RL in both online and offline settings, where the agent is assigned with a new task sharing the same representation as the upstream tasks. For both online and offline settings, we develop a sample-efficient algorithm, and show that it finds a near-optimal policy with the suboptimality gap bounded by the sum of the estimation error of the learned representation in upstream and a vanishing term as the number of downstream samples becomes large. Our downstream results of online and offline RL further capture the benefit of employing the learned representation from upstream as opposed to learning the representation of the low-rank model directly. To the best of our knowledge, this is the first theoretical study that characterizes the benefit of representation learning in exploration-based reward-free multitask RL for both upstream and downstream tasks.

[1]  S. Du,et al.  Provable General Function Class Representation Learning in Multitask Bandits and MDPs , 2022, NeurIPS.

[2]  Alekh Agarwal,et al.  Provable Benefits of Representational Transfer in Reinforcement Learning , 2022, COLT.

[3]  Yu-Xiang Wang,et al.  Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism , 2022, ICLR.

[4]  M. J. Azizi,et al.  Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms , 2022, ArXiv.

[5]  Dylan R. Ashley,et al.  All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL , 2022, ArXiv.

[6]  M. Pontil,et al.  Multi-task Representation Learning with Stochastic Linear Bandits , 2022, AISTATS.

[7]  Alekh Agarwal,et al.  Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach , 2022, ICML.

[8]  Aldo Pacchiano,et al.  Meta Learning MDPs with Linear Transition Models , 2022, AISTATS.

[9]  Samet Oymak,et al.  Non-Stationary Representation Learning in Sequential Linear Bandits , 2022, IEEE Open Journal of Control Systems.

[10]  Yu-Xiang Wang,et al.  Towards Instance-Optimal Offline Reinforcement Learning with Pessimism , 2021, NeurIPS.

[11]  Wen Sun,et al.  Representation Learning for Online and Offline RL in Low-rank MDPs , 2021, ICLR.

[12]  Martin J. Wainwright,et al.  Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning , 2021, NeurIPS.

[13]  Simon S. Du,et al.  On the Power of Multitask Representation Learning in Linear MDP , 2021, ArXiv.

[14]  Alekh Agarwal,et al.  Bellman-consistent Pessimism for Offline Reinforcement Learning , 2021, NeurIPS.

[15]  Caiming Xiong,et al.  Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning , 2021, NeurIPS.

[16]  Yu-Xiang Wang,et al.  Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings , 2021, NeurIPS.

[17]  S. Du,et al.  Nearly Horizon-Free Offline Reinforcement Learning , 2021, NeurIPS.

[18]  Shachar Lovett,et al.  Bilinear Classes: A Structural Framework for Provable Generalization in RL , 2021, ICML.

[19]  A. Krishnamurthy,et al.  Model-free Representation Learning and Exploration in Low-rank MDPs , 2021, ArXiv.

[20]  Joelle Pineau,et al.  Multi-Task Reinforcement Learning with Context-based Representations , 2021, ICML.

[21]  Xiaoyu Chen,et al.  Near-optimal Representation Learning for Linear Bandits and Linear RL , 2021, ICML.

[22]  Chi Jin,et al.  Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms , 2021, NeurIPS.

[23]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[24]  Andrea Zanette,et al.  Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL , 2020, ICML.

[25]  Ruosong Wang,et al.  What are the Statistical Limits of Offline RL with Linear Function Approximation? , 2020, ICLR.

[26]  Jason D. Lee,et al.  Impact of Representation Learning in Linear Bandits , 2020, ICLR.

[27]  Mykel J. Kochenderfer,et al.  Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration , 2020, NeurIPS.

[28]  Gergely Neu,et al.  A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[29]  S. Kakade,et al.  FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , 2020, NeurIPS.

[30]  Andrea Bonarini,et al.  Sharing Knowledge in Multi-Task Deep Reinforcement Learning , 2020, ICLR.

[31]  Mykel J. Kochenderfer,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[32]  Michael I. Jordan,et al.  Provable Meta-Learning of Linear Representations , 2020, ICML.

[33]  Sanjeev Arora,et al.  Provable Representation Learning for Imitation Learning via Bi-level Optimization , 2020, ICML.

[34]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[35]  Weihao Kong,et al.  Meta-learning for mixed linear regression , 2020, ICML.

[36]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[37]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[38]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[39]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[40]  Lin F. Yang,et al.  Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[41]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[42]  J. Langford,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[43]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[44]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[45]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[46]  Ürün Dogan,et al.  Multi-Task Learning for Contextual Bandits , 2017, NIPS.

[47]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[48]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[49]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[50]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[51]  Daniele Calandriello,et al.  Sparse multi-task reinforcement learning , 2014, Intelligenza Artificiale.

[52]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[53]  Lihong Li,et al.  Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[54]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[55]  H. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[56]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[57]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[58]  Sven Koenig,et al.  Complexity Analysis of Real-Time Reinforcement Learning , 1992, AAAI.

[59]  Massimiliano Pontil,et al.  Multi-task and meta-learning with sparse linear bandits , 2021, UAI.