Off-Policy Deep Reinforcement Learning without Exploration

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning with data uncorrelated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.

[1]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[2]  Karl Tuyls,et al.  The importance of experience replay database composition in deep reinforcement learning , 2015 .

[3]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[4]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[5]  Sergey Levine,et al.  Residual Reinforcement Learning for Robot Control , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[6]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[9]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[10]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[11]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[13]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[16]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[17]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[18]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[19]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[20]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[21]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[22]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[23]  Joelle Pineau,et al.  Randomized Value Functions via Multiplicative Normalizing Flows , 2018, UAI.

[24]  Stefano Ermon,et al.  Model-Free Imitation Learning with Policy Optimization , 2016, ICML.

[25]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[26]  Leslie Pack Kaelbling,et al.  Residual Policy Learning , 2018, ArXiv.

[27]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[28]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[29]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[30]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[31]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[32]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[33]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[34]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[35]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[36]  Noah D. Goodman,et al.  Learning the Preferences of Ignorant, Inconsistent Agents , 2015, AAAI.

[37]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[38]  Gregory Dudek,et al.  Synthesizing Neural Network Controllers with Probabilistic Model-Based Reinforcement Learning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[39]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[40]  Nolan Wagener,et al.  Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[41]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[44]  Pierre-Yves Oudeyer,et al.  GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms , 2017, ICML.

[45]  Katja Hofmann,et al.  Depth and nonlinearity induce implicit exploration for RL , 2018, ArXiv.

[46]  David Isele,et al.  Selective Experience Replay for Lifelong Learning , 2018, AAAI.

[47]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[48]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[49]  Matthieu Geist,et al.  Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[50]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[51]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[52]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[53]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Alessandro Lazaric,et al.  Direct Policy Iteration with Demonstrations , 2015, IJCAI.

[55]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[56]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[57]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[58]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[59]  Joelle Pineau,et al.  Learning from Limited Demonstrations , 2013, NIPS.

[60]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[61]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[62]  Craig Boutilier,et al.  Non-delusional Q-learning and value-iteration , 2018, NeurIPS.

[63]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[64]  Jan Peters,et al.  Non-parametric Policy Search with Limited Information Loss , 2017, J. Mach. Learn. Res..

[65]  Byron Boots,et al.  Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[66]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[67]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[68]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[69]  Don Joven Agravante,et al.  Constrained Exploration and Recovery from Experience Shaping , 2018, ArXiv.

[70]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[71]  Yang Gao,et al.  Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[72]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[73]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[74]  Richard S. Sutton,et al.  A Deeper Look at Experience Replay , 2017, ArXiv.

[75]  Robert Babuska,et al.  Improved deep reinforcement learning for robotics through distribution-based experience retention , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[76]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[77]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[78]  Yao Liu,et al.  Representation Balancing MDPs for Off-Policy Policy Evaluation , 2018, NeurIPS.

[79]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[80]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[81]  Mohamed Medhat Gaber,et al.  Imitation Learning , 2017, ACM Comput. Surv..