Striving for Simplicity in Off-policy Deep Reinforcement Learning

Reflecting on the advances of off-policy deep reinforcement learning (RL) algorithms since the development of DQN in 2013, it is important to ask: are the complexities of recent off-policy methods really necessary? In an attempt to isolate the contributions of various factors of variation in off-policy deep RL and to help design simpler algorithms, this paper investigates a set of related questions: First, can effective policies be learned given only access to logged offline experience? Second, how much of the benefits of recent distributional RL algorithms is attributed to improvements in exploration versus exploitation behavior? Third, can simpler off-policy RL algorithms outperform distributional RL without learning explicit distributions over returns? This paper uses a batch RL experimental setup on Atari 2600 games to investigate these questions. Unexpectedly, we find that batch RL algorithms trained solely on logged experiences of a DQN agent are able to significantly outperform online DQN. Our experiments suggest that the benefits of distributional RL mainly stem from better exploitation. We present a simple and novel variant of ensemble Q-learning called Random Ensemble Mixture (REM), which enforces optimal Bellman consistency on random convex combinations of the Q-heads of a multi-head Q-network. The batch REM agent trained offline on DQN data outperforms the batch QR-DQN and online C51 algorithms.

[1]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[4]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[5]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[6]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[7]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[9]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[10]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[11]  Friedhelm Schwenker,et al.  Neural Network Ensembles in Reinforcement Learning , 2013, Neural Processing Letters.

[12]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[13]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[14]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[15]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[16]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[18]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[19]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[20]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[21]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[22]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[23]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[24]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[25]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[26]  Nicolas Le Roux,et al.  A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[27]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[28]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[29]  Peter Stone,et al.  Batch reinforcement learning in a complex domain , 2007, AAMAS '07.

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[32]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[33]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[34]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[35]  Tuomas Sandholm,et al.  Libratus: The Superhuman AI for No-Limit Poker , 2017, IJCAI.

[36]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[37]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[38]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[39]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[40]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[41]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[42]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[43]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[44]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[45]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[46]  Nicolas Le Roux,et al.  Distributional reinforcement learning with linear function approximation , 2019, AISTATS.

[47]  Marc G. Bellemare,et al.  A Comparative Analysis of Expected and Distributional Reinforcement Learning , 2019, AAAI.

[48]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[49]  Karl Tuyls,et al.  The importance of experience replay database composition in deep reinforcement learning , 2015 .

[50]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[51]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[52]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[53]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[54]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[55]  Richard S. Sutton,et al.  A Deeper Look at Experience Replay , 2017, ArXiv.

[56]  Sergey Levine,et al.  Off-Policy Evaluation via Off-Policy Classification , 2019, NeurIPS.

[57]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[58]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[59]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[60]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[61]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[62]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .