Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning.

[1]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[2]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Deep Reinforcement Learning , 2020, International Conference on Machine Learning.

[3]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[4]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[5]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[6]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[7]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[8]  Bo Dai,et al.  Off-Policy Evaluation via the Regularized Lagrangian , 2020, NeurIPS.

[9]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[10]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[11]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[12]  Nando de Freitas,et al.  Hyperparameter Selection for Offline Reinforcement Learning , 2020, ArXiv.

[13]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[14]  Sergey Levine,et al.  Benchmarks for Deep Off-Policy Evaluation , 2021, ICLR.

[15]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[18]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[19]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[20]  Claude Sammut,et al.  A Framework for Behavioural Cloning , 1995, Machine Intelligence 15.

[21]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2020, ICML.

[22]  Roberto Calandra,et al.  Objective Mismatch in Model-based Reinforcement Learning , 2020, L4DC.

[23]  Yutaka Matsuo,et al.  Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization , 2020, ICLR.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[27]  Peter Stone,et al.  Importance Sampling Policy Evaluation with an Estimated Behavior Policy , 2018, ICML.

[28]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Bo Dai,et al.  Batch Stationary Distribution Estimation , 2020, ICML.

[31]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[32]  Hoang Minh Le,et al.  Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning , 2019, NeurIPS Datasets and Benchmarks.

[33]  Ilya Kostrikov,et al.  Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation , 2020, ArXiv.

[34]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[35]  Evangelos Theodorou,et al.  Model Predictive Path Integral Control using Covariance Variable Importance Sampling , 2015, ArXiv.

[36]  Lihong Li,et al.  On Minimax Optimal Offline Policy Evaluation , 2014, ArXiv.

[37]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[38]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[39]  Andrew W. Moore,et al.  Memory-Based Reinforcement Learning: Efficient Computation with Prioritized Sweeping , 1992, NIPS.

[40]  Peter Stone,et al.  Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation , 2016, AAAI.

[41]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[42]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[43]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[44]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[45]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[46]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[47]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[48]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  Shie Mannor,et al.  Off-policy Model-based Learning under Unknown Factored Dynamics , 2015, ICML.