论文信息 - Conservative Q-Learning for Offline Reinforcement Learning - 字舞流文

Conservative Q-Learning for Offline Reinforcement Learning

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

S. Levine | G. Tucker | Aviral Kumar | Aurick Zhou

[1] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[2] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[3] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[4] Leonid Peshkin,et al. Learning from Scarce Experience , 2002, ICML.

[5] Laurent El Ghaoui,et al. Robustness in Markov Decision Problems with Uncertain Transition Matrices , 2003, NIPS.

[6] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[7] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[8] Garud Iyengar,et al. Robust Dynamic Programming , 2005, Math. Oper. Res..

[9] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[10] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[12] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[13] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[14] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[15] Shie Mannor,et al. Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[16] Bruno Scherrer,et al. Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[17] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.

[18] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[21] Marek Petrik,et al. Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[22] Marc G. Bellemare,et al. Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[23] Martin A. Riedmiller,et al. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[24] John C. Duchi,et al. Variance-based Regularization with Convex Objectives , 2016, NIPS.

[25] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[26] Benjamin Van Roy,et al. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[27] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[28] Shie Mannor,et al. Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[29] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[30] Mark S. Squillante,et al. A General Family of Robust Stochastic Operators for Reinforcement Learning , 2018, ArXiv.

[31] Sergey Levine,et al. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[32] Tom Schaul,et al. Deep Q-learning From Demonstrations , 2017, AAAI.

[33] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[34] Sergey Levine,et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[35] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[36] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[37] Pieter Abbeel,et al. Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[38] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[39] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[40] Natasha Jaques,et al. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[41] Romain Laroche,et al. Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[42] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[43] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[44] Sergey Levine,et al. Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , 2019, CoRL.

[45] Yifan Wu,et al. Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[46] Sergey Levine,et al. Diagnosing Bottlenecks in Deep Q-learning Algorithms , 2019, ICML.

[47] Sergey Levine,et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[48] Nan Jiang,et al. Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[49] Marc G. Bellemare,et al. Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[50] Romain Laroche,et al. Safe Policy Improvement with Soft Baseline Bootstrapping , 2019, ECML/PKDD.

[51] S. Levine,et al. Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[52] Martin A. Riedmiller,et al. Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[53] Nan Jiang,et al. $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[54] Bo Dai,et al. GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[55] Rishabh Agarwal,et al. An Optimistic Perspective on Offline Reinforcement Learning , 2019, ICML.

[56] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[57] S. Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[58] Romain Laroche,et al. Safe Policy Improvement with an Estimated Baseline Policy , 2019, AAMAS.

[59] Justin Fu,et al. D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[60] Sergey Levine,et al. DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction , 2020, NeurIPS.

[61] Tengyu Ma,et al. Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling , 2019, ICLR.

[62] Brendan O'Donoghue,et al. Variational Bayesian Reinforcement Learning with Regret Bounds , 2018, NeurIPS.