论文信息 - Quantile Filtered Imitation Learning - 字舞流文

Quantile Filtered Imitation Learning

We introduce quantile filtered imitation learning (QFIL), a novel policy improvement operator designed for offline reinforcement learning. QFIL performs policy improvement by running imitation learning on a filtered version of the offline dataset. The filtering process removes s, a pairs whose estimated Q values fall below a given quantile of the pushforward distribution over values induced by sampling actions from the behavior policy. The definitions of both the pushforward Q distribution and resulting value function quantile are key contributions of our method. We prove that QFIL gives us a safe policy improvement step with function approximation and that the choice of quantile provides a natural hyperparameter to trade off bias and variance of the improvement step. Empirically, we perform a synthetic experiment illustrating how QFIL effectively makes a bias-variance tradeoff and we see that QFIL performs well on the D4RL benchmark.

Rajesh Ranganath | William F. Whitney | Joan Bruna | David Brandfonbrener | R. Ranganath | Joan Bruna | David Brandfonbrener

[1] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[2] Pieter Abbeel,et al. Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[3] Qing Wang,et al. Exponentially Weighted Imitation Learning for Batched Historical Data , 2018, NeurIPS.

[4] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5] S. Levine,et al. Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[6] Scott Fujimoto,et al. A Minimalist Approach to Offline Reinforcement Learning , 2021, NeurIPS.

[7] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[8] Sergey Levine,et al. D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[9] Ilya Kostrikov,et al. Offline Reinforcement Learning with Fisher Divergence Critic Regularization , 2021, ICML.

[10] Martin A. Riedmiller,et al. Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[11] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[12] Nando de Freitas,et al. Critic Regularized Regression , 2020, NeurIPS.

[13] Yifan Wu,et al. Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[14] Emma Brunskill,et al. Provably Good Batch Reinforcement Learning Without Great Exploration , 2020, ArXiv.

[15] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16] Romain Laroche,et al. Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[17] Zachary C. Lipton,et al. What is the Effect of Importance Weighting in Deep Learning? , 2018, ICML.

[18] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[19] S. Levine,et al. Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[20] Sergey Levine,et al. Offline Reinforcement Learning with Implicit Q-Learning , 2021, ICLR.

[21] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22] Rajesh Ranganath,et al. Offline RL Without Off-Policy Evaluation , 2021, NeurIPS.

[23] S. Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[24] Sergey Levine,et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.