Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability

In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters – the rewards and transitions – is assumed, and a policy that optimizes the (posterior) expected return is sought. A common approximation, which has been recently popularized as metaRL, is to train the agent on a sample of N problem instances from the prior, with the hope that for large enough N , good generalization behavior to an unseen test instance will be obtained. In this work, we study generalization in Bayesian RL under the probably approximately correct (PAC) framework, using the method of algorithmic stability. Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense. Most stability results in the literature build on strong convexity of the regularized loss – an approach that is not suitable for RL as Markov decision processes (MDPs) are not convex. Instead, building on recent results of fast convergence rates for mirror descent in regularized MDPs, we show that regularized MDPs satisfy a certain quadratic growth criterion, which is sufficient to establish stability. This result, which may be of independent interest, allows us to study the effect of regularization on generalization in the Bayesian RL setting.

[1]  Vladimir Braverman,et al.  The Benefits of Implicit Regularization from SGD in Least Squares Problems , 2021, ArXiv.

[2]  Erez Karpas,et al.  Generalized Planning With Deep Reinforcement Learning , 2020, ArXiv.

[3]  Christos Dimitrakakis,et al.  Bayesian Reinforcement Learning via Deep, Sparse Sampling , 2020, AISTATS.

[4]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[5]  Ron Meir,et al.  Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory , 2017, ICML.

[6]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[7]  Chelsea Finn,et al.  Meta-Learning with Fewer Tasks through Task Interpolation , 2021, ArXiv.

[8]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[9]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[10]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[11]  Andreas Krause,et al.  PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees , 2020, ICML.

[12]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[13]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[14]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[15]  Brian F. Hutton,et al.  What is the distribution of the number of unique original items in a bootstrap sample , 2016, 1602.05822.

[16]  Dimitris S. Papailiopoulos,et al.  Stability and Generalization of Learning Algorithms that Converge to Global Optima , 2017, ICML.

[17]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[18]  Shie Mannor,et al.  Contextual Markov Decision Processes , 2015, ArXiv.

[19]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[20]  Yu Zhang,et al.  Multi-Task Learning and Algorithmic Stability , 2015, AAAI.

[21]  J. Schulman,et al.  Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[22]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[23]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[24]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[25]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[26]  Aviv Tamar,et al.  Offline Meta Learning of Exploration , 2020 .

[27]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[28]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[29]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[30]  Shimon Whiteson,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2020, ICLR.

[31]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[32]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[33]  Anirudha Majumdar,et al.  PAC-BUS: Meta-Learning Bounds via PAC-Bayes and Uniform Stability , 2021, ArXiv.

[34]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[35]  Craig Boutilier,et al.  Differentiable Meta-Learning of Bandit Policies , 2020, NeurIPS.