Offline Reinforcement Learning as One Big Sequence Modeling Problem

Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models, leveraging the Markov property to factorize the problem in time. However, we can also view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether powerful, high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide simple and effective solutions to the RL problem. To this end, we explore how RL can be reframed as “one big sequence modeling” problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards. Addressing RL as a sequence modeling problem significantly simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL. All of these roles are filled by the same Transformer sequence model. In our experiments, we demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL.

[1]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2]  Seyed Kamyar Seyed Ghasemipour,et al.  EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL , 2020, ICML.

[3]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[4]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[5]  Filipe Wall Mutz,et al.  Hindsight policy gradients , 2017, ICLR.

[6]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[7]  Michael Fairbank,et al.  Reinforcement Learning by Value Gradients , 2008, ArXiv.

[8]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[9]  Richard S. Sutton,et al.  Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[10]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[11]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[12]  Yu Bai,et al.  Near-Optimal Offline Reinforcement Learning via Double Variance Reduction , 2021, ArXiv.

[13]  Matthieu Geist,et al.  Offline Reinforcement Learning with Pseudometric Learning , 2021, ICML.

[14]  Gabriel Dulac-Arnold,et al.  Model-Based Offline Planning , 2020, ArXiv.

[15]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[16]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[17]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[18]  Mohammad Norouzi,et al.  Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization , 2021, ICLR.

[19]  Stefano Ermon,et al.  Calibrated Model-Based Deep Reinforcement Learning , 2019, ICML.

[20]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Sheila A. McIlraith,et al.  Planning from Pixels using Inverse Dynamics Models , 2020, ICLR.

[22]  Qi Liu,et al.  Insertion-based Decoding with Automatically Inferred Generation Order , 2019, Transactions of the Association for Computational Linguistics.

[23]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[24]  Ryan Cotterell,et al.  If Beam Search Is the Answer, What Was the Question? , 2020, EMNLP.

[25]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[26]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[27]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[28]  Sergey Levine,et al.  Learning to Reach Goals via Iterated Supervised Learning , 2019, ICLR.

[29]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[30]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[31]  Glen Berseth,et al.  DeepLoco , 2017, ACM Trans. Graph..

[32]  Sergey Levine,et al.  Offline Reinforcement Learning with Implicit Q-Learning , 2021, ICLR.

[33]  Filipe Wall Mutz,et al.  Training Agents using Upside-Down Reinforcement Learning , 2019, ArXiv.

[34]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[35]  Andrew Gordon Wilson,et al.  On the model-based stochastic value gradient for continuous reinforcement learning , 2020, L4DC.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[39]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[40]  Jerrod Parker,et al.  Adaptive Transformers in RL , 2020, ArXiv.

[41]  Martin A. Riedmiller,et al.  Approximate model-assisted Neural Fitted Q-Iteration , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[42]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[43]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[44]  Sergey Levine,et al.  Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings , 2018, ICML.

[45]  S. Levine,et al.  γ-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction , 2020, ArXiv.

[46]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[47]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[48]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[49]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[50]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[51]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[52]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[53]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[54]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[55]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[56]  Sergey Levine,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[57]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[58]  Siddhartha Banerjee,et al.  Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[59]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[60]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[61]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[62]  Daan Wierstra,et al.  Recurrent Environment Simulators , 2017, ICLR.

[63]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[64]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[65]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[66]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[67]  Chelsea Finn,et al.  Language as an Abstraction for Hierarchical Deep Reinforcement Learning , 2019, NeurIPS.

[68]  Ruslan Salakhutdinov,et al.  Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation , 2021, ICLR.

[69]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[70]  Juergen Schmidhuber,et al.  Reinforcement Learning Upside Down: Don't Predict Rewards - Just Map Them to Actions , 2019, ArXiv.

[71]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[72]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.