论文信息 - Discount Factor as a Regularizer in Reinforcement Learning

Discount Factor as a Regularizer in Reinforcement Learning

Specifying a Reinforcement Learning (RL) task involves choosing a suitable planning horizon, which is typically modeled by a discount factor. It is known that applying RL algorithms with a lower discount factor can act as a regularizer, improving performance in the limited data regime. Yet the exact nature of this regularizer has not been investigated. In this work, we fill in this gap. For several Temporal-Difference (TD) learning methods, we show an explicit equivalence between using a reduced discount factor and adding an explicit regularization term to the algorithm’s loss. Motivated by the equivalence, we empirically study this technique compared to standard L2 regularization by extensive experiments in discrete and continuous domains, using tabular and functional representations. Our experiments suggest the regularization effectiveness is strongly related to properties of the available data, such as size, distribution, and mixing rate.

[1] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2] Richard Socher,et al. Revisiting Activation Regularization for Language RNNs , 2017, ArXiv.

[3] Joelle Pineau,et al. Separating value functions across time-scales , 2019, ICML 2019.

[4] Olivier Sigaud,et al. Investigating Generalisation in Continuous Deep Reinforcement Learning , 2019, ArXiv.

[5] Marek Petrik,et al. Biasing Approximate Dynamic Programming with a Lower Discount Factor , 2008, NIPS.

[6] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[7] D. Jerison,et al. General mixing time bounds for finite Markov chains via the absolute spectral gap , 2013, 1310.8021.

[8] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9] Shie Mannor,et al. Reward Tweaking: Maximizing the Total Reward While Planning for Short Horizons. , 2020 .

[10] Sham M. Kakade,et al. Optimizing Average Reward Using Discounted Rewards , 2001, COLT/EuroCOLT.

[11] Nan Jiang,et al. On Structural Properties of MDPs that Bound Loss Due to Shallow Planning , 2016, IJCAI.

[12] Jan Peters,et al. Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[13] Nan Jiang,et al. Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[14] Rutherford Aris,et al. Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[15] Trevor Darrell,et al. Regularization Matters in Policy Optimization , 2019, ArXiv.

[16] Samy Bengio,et al. A Study on Overfitting in Deep Reinforcement Learning , 2018, ArXiv.

[17] Nan Jiang,et al. The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[18] Damien Ernst,et al. On overfitting and asymptotic bias in batch reinforcement learning with partial observability , 2017, J. Artif. Intell. Res..

[19] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[21] Patrick M. Pilarski,et al. Gamma-Nets: Generalizing Value Estimation over Timescale , 2019, AAAI.

[22] Marlos C. Machado,et al. Generalization and Regularization in DQN , 2018, ArXiv.

[23] Mohammad Emtiyaz Khan,et al. TD-regularized actor-critic methods , 2018, Machine Learning.

[24] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25] R. Bellman. A Markovian Decision Process , 1957 .

[26] Shimon Whiteson,et al. A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[27] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[28] Mykel J. Kochenderfer,et al. Improving Offline Value-Function Approximations for POMDPs by Reducing Discount Factors , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[29] V. Climenhaga. Markov chains and mixing times , 2013 .

[30] Nicolas Le Roux,et al. Understanding the impact of entropy on policy optimization , 2018, ICML.

[31] Hermann Ney,et al. Improving Neural Language Models with Weight Norm Initialization and Regularization , 2018, WMT.

[32] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[33] Mohammad Ghavamzadeh,et al. Multi-step Greedy Policies in Model-Free Deep Reinforcement Learning , 2019, ArXiv.

[34] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[35] Shie Mannor,et al. Beyond the One Step Greedy Approach in Reinforcement Learning , 2018, ICML.

[36] Richard Socher,et al. On the Generalization Gap in Reparameterizable Reinforcement Learning , 2019, ICML.