论文信息 - Model-Augmented Q-learning

Model-Augmented Q-learning

In recent years, Q-learning has become indispensable for model-free reinforcement learning (MFRL). However, it suffers from well-known problems such as underand overestimation bias of the value, which may adversely affect the policy learning. To resolve this issue, we propose a MFRL framework that is augmented with the components of model-based RL. Specifically, we propose to estimate not only the Q-values but also both the transition and the reward with a shared network. We further utilize the estimated reward from the model estimators for Q-learning, which promotes interaction between the estimators. We show that the proposed scheme, called Model-augmented Q-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward. Finally, we also provide a trick to prioritize past experiences in the replay buffer by utilizing model-estimation errors. We experimentally validate MQL built upon state-of-the-art off-policy MFRL methods, and show that MQL largely improves their performance and convergence. The proposed scheme is simple to implement and does not require additional training cost.

[1] Sam Devlin,et al. Dynamic potential-based reward shaping , 2012, AAMAS.

[2] Sergey Levine,et al. When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[3] Dong Yan,et al. Reward Shaping via Meta-Learning , 2019, ArXiv.

[4] Sergey Levine,et al. Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[5] Yu Zhang,et al. A Simple General Approach to Balance Task Difficulty in Multi-Task Learning , 2020, ArXiv.

[6] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[7] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[8] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[9] Che Wang,et al. Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past , 2019, ArXiv.

[10] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[12] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[13] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[14] Chunlin Chen,et al. A novel DDPG method with prioritized experience replay , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[15] Peng Wei,et al. Prioritized Sequence Experience Replay , 2019, ArXiv.

[16] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[17] Pieter Abbeel,et al. Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[18] Martha White,et al. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[19] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[20] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[21] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23] Sergey Levine,et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[24] Yuandong Tian,et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[25] Daochen Zha,et al. Experience Replay Optimization , 2019, IJCAI.

[26] Sam Devlin,et al. Expressing Arbitrary Reward Functions as Potential-Based Advice , 2015, AAAI.

[27] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[28] Tamim Asfour,et al. Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[29] Yujing Hu,et al. Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping , 2020, NeurIPS.

[30] Matteo Hessel,et al. When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[31] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[32] Stefano Ermon,et al. Experience Replay with Likelihood-free Importance Weights , 2020, L4DC.

[33] Garrison W. Cottrell,et al. Principled Methods for Advising Reinforcement Learning Agents , 2003, ICML.