Model-Augmented Q-learning

In recent years, Q-learning has become indispensable for model-free reinforcement learning (MFRL). However, it suffers from well-known problems such as underand overestimation bias of the value, which may adversely affect the policy learning. To resolve this issue, we propose a MFRL framework that is augmented with the components of model-based RL. Specifically, we propose to estimate not only the Q-values but also both the transition and the reward with a shared network. We further utilize the estimated reward from the model estimators for Q-learning, which promotes interaction between the estimators. We show that the proposed scheme, called Model-augmented Q-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward. Finally, we also provide a trick to prioritize past experiences in the replay buffer by utilizing model-estimation errors. We experimentally validate MQL built upon state-of-the-art off-policy MFRL methods, and show that MQL largely improves their performance and convergence. The proposed scheme is simple to implement and does not require additional training cost.

[1]  Sam Devlin,et al.  Dynamic potential-based reward shaping , 2012, AAMAS.

[2]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[3]  Dong Yan,et al.  Reward Shaping via Meta-Learning , 2019, ArXiv.

[4]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[5]  Yu Zhang,et al.  A Simple General Approach to Balance Task Difficulty in Multi-Task Learning , 2020, ArXiv.

[6]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[7]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[8]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[9]  Che Wang,et al.  Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past , 2019, ArXiv.

[10]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[12]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[13]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[14]  Chunlin Chen,et al.  A novel DDPG method with prioritized experience replay , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[15]  Peng Wei,et al.  Prioritized Sequence Experience Replay , 2019, ArXiv.

[16]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[17]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[18]  Martha White,et al.  Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[19]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[20]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[25]  Daochen Zha,et al.  Experience Replay Optimization , 2019, IJCAI.

[26]  Sam Devlin,et al.  Expressing Arbitrary Reward Functions as Potential-Based Advice , 2015, AAAI.

[27]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[28]  Tamim Asfour,et al.  Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[29]  Yujing Hu,et al.  Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping , 2020, NeurIPS.

[30]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[32]  Stefano Ermon,et al.  Experience Replay with Likelihood-free Importance Weights , 2020, L4DC.

[33]  Garrison W. Cottrell,et al.  Principled Methods for Advising Reinforcement Learning Agents , 2003, ICML.