High Confidence Generalization for Reinforcement Learning

We present several classes of reinforcement learning algorithms that safely generalize to Markov decision processes (MDPs) not seen during training. Specifically, we study the setting in which some set of MDPs is accessible for training. For various definitions of safety, our algorithms give probabilistic guarantees that agents can safely generalize to MDPs that are sampled from the same distribution but are not necessarily in the training set. These algorithms are a type of Seldonian algorithm (Thomas et al., 2019), which is a class of machine learning algorithms that return models with probabilistic safety guarantees for user-specified definitions of safety.

[1]  Xingyou Song,et al.  Observational Overfitting in Reinforcement Learning , 2019, ICLR.

[2]  Philip S. Thomas,et al.  Concentration Inequalities for Conditional Value at Risk , 2019, ICML.

[3]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[4]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[5]  Xingyou Song,et al.  The Principle of Unchanged Optimality in Reinforcement Learning Generalization , 2019, ArXiv.

[6]  Honglak Lee,et al.  Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[7]  Yuriy Brun,et al.  Preventing undesirable behavior of intelligent machines , 2019, Science.

[8]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[9]  Meysam Bastani,et al.  Model-Free Intelligent Diabetes Management Using Machine Learning , 2014 .

[10]  David B. Brown,et al.  Large deviations bounds for estimating conditional value-at-risk , 2007, Oper. Res. Lett..

[11]  Michael L. Littman,et al.  Measuring and Characterizing Generalization in Deep Reinforcement Learning , 2018, Applied AI Letters.

[12]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[13]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[14]  Andrew G. Barto,et al.  Autonomous shaping: knowledge transfer in reinforcement learning , 2006, ICML.

[15]  Peter Stone,et al.  Transfer Learning via Inter-Task Mappings for Temporal Difference Learning , 2007, J. Mach. Learn. Res..

[16]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[17]  Finale Doshi-Velez,et al.  Hidden Parameter Markov Decision Processes: A Semiparametric Regression Approach for Discovering Latent Task Parametrizations , 2013, IJCAI.

[18]  Kathleen M. Jagodnik,et al.  Reinforcement Learning and Feedback Control for High-Level Upper-Extremity Neuroprostheses , 2014 .

[19]  Joelle Pineau,et al.  A Dissection of Overfitting and Generalization in Continuous Reinforcement Learning , 2018, ArXiv.

[20]  Robert F. Kirsch,et al.  Combined feedforward and feedback control of a redundant, nonlinear, dynamic musculoskeletal system , 2009, Medical & Biological Engineering & Computing.

[21]  Finale Doshi-Velez,et al.  Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes , 2017, AAAI.

[22]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[23]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[24]  Romain Laroche,et al.  Safe Policy Improvement with an Estimated Baseline Policy , 2020, AAMAS.

[25]  Romain Laroche,et al.  Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[26]  Richard Socher,et al.  On the Generalization Gap in Reparameterizable Reinforcement Learning , 2019, ICML.

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.