Learning Dynamics and Generalization in Deep Reinforcement Learning

Solving a reinforcement learning (RL) problem poses two competing challenges: fitting a poten-tially discontinuous value function, and generalizing well to new observations. In this paper, we analyze the learning dynamics of temporal difference algorithms to gain novel insight into the tension between these two objectives. We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization. We corrob-orate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods. Finally, we investigate how post-training policy distillation may avoid this pit-fall, and show that this approach improves generalization to novel environments in the ProcGen suite and improves robustness to input perturbations.

[1]  Pulkit Agrawal,et al.  Overcoming the Spectral Bias of Neural Value Approximation , 2022, ICLR.

[2]  Pierre-Luc Bacon,et al.  The Primacy Bias in Deep Reinforcement Learning , 2022, ICML.

[3]  Sergey Levine,et al.  DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization , 2021, ICLR.

[4]  A Survey of Generalisation in Deep Reinforcement Learning , 2021, ArXiv.

[5]  Georg Ostrovski,et al.  The Difficulty of Passive Learning in Deep Reinforcement Learning , 2021, NeurIPS.

[6]  Francis Bach,et al.  Batch Normalization Orthogonalizes Representations in Deep Random Networks , 2021, NeurIPS.

[7]  Clare Lyle,et al.  On The Effect of Auxiliary Tasks on Representation Dynamics , 2021, AISTATS.

[8]  Rob Fergus,et al.  Decoupling Value and Policy for Generalization in Reinforcement Learning , 2021, ICML.

[9]  Soham De,et al.  On the Origin of Implicit Regularization in Stochastic Gradient Descent , 2021, ICLR.

[10]  Xiaolong Wang,et al.  Generalization in Reinforcement Learning by Soft Data Augmentation , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[11]  D. Barrett,et al.  Implicit Gradient Regularization , 2020, ICLR.

[12]  Jeannette Bohg,et al.  GRAC: Self-Guided and Self-Regularized Actor-Critic , 2020, CoRL.

[13]  John Schulman,et al.  Phasic Policy Gradient , 2020, ICML.

[14]  R. Fergus,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ICLR.

[15]  Jiashi Feng,et al.  Improving Generalization in Reinforcement Learning with Mixture Regularization , 2020, NeurIPS.

[16]  Jaehoon Lee,et al.  Finite Versus Infinite Neural Networks: an Empirical Study , 2020, NeurIPS.

[17]  Marc G. Bellemare,et al.  Representations for Stable Off-Policy Reinforcement Learning , 2020, ICML.

[18]  Vincent Liu,et al.  Towards a practical measure of interference for reinforcement learning , 2020, ArXiv.

[19]  P. Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[20]  David Tse,et al.  A Fourier-Based Approach to Generalization and Optimization in Deep Learning , 2020, IEEE Journal on Selected Areas in Information Theory.

[21]  Adam White,et al.  Improving Performance in Reinforcement Learning by Breaking Generalization in Neural Networks , 2020, AAMAS.

[22]  Joelle Pineau,et al.  Interference and Generalization in Temporal Difference Learning , 2020, ICML.

[23]  Doina Precup,et al.  Invariant Causal Prediction for Block MDPs , 2020, ICML.

[24]  Gregory W. Benton,et al.  Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited , 2020, ArXiv.

[25]  Xingyou Song,et al.  Observational Overfitting in Reinforcement Learning , 2019, ICLR.

[26]  J. Schulman,et al.  Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[27]  Sina Ghiassian,et al.  Overcoming Catastrophic Interference in Online Reinforcement Learning with Dynamic Self-Organizing Maps , 2019, ArXiv.

[28]  Sam Devlin,et al.  Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck , 2019, NeurIPS.

[29]  Yarin Gal,et al.  Generalizing from a few environments in safety-critical reinforcement learning , 2019, ArXiv.

[30]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[31]  Razvan Pascanu,et al.  Ray Interference: a Source of Plateaus in Deep Reinforcement Learning , 2019, ArXiv.

[32]  Pieter Abbeel,et al.  Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[33]  Razvan Pascanu,et al.  Distilling Policy Distillation , 2019, AISTATS.

[34]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[35]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[36]  Marlos C. Machado,et al.  Generalization and Regularization in DQN , 2018, ArXiv.

[37]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[38]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[39]  Samy Bengio,et al.  A Study on Overfitting in Deep Reinforcement Learning , 2018, ArXiv.

[40]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[41]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[42]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[43]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[44]  Kimberly L. Stachenfeld,et al.  The hippocampus as a predictive map , 2017, Nature Neuroscience.

[45]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[46]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[47]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[48]  Tamara Tosic,et al.  Graph-based regularization for spherical signal interpolation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[50]  Sridhar Mahadevan,et al.  Proto-value functions: developmental reinforcement learning , 2005, ICML.

[51]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[52]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[53]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.

[54]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[55]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[56]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[57]  Jürgen Schmidhuber,et al.  Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.

[58]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.