Lipschitz Lifelong Reinforcement Learning

We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. We illustrate the benefits of the method in Lifelong RL experiments.

[1]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[2]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[3]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[4]  Eric Eaton,et al.  An automated measure of MDP similarity for transfer in reinforcement learning , 2014, AAAI 2014.

[5]  Yang Gao,et al.  Measuring the Distance Between Finite Markov Decision Processes , 2016, AAMAS.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Luca Bascetta,et al.  Policy gradient in Lipschitz Markov Decision Processes , 2015, Machine Learning.

[8]  J. Neyman Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability , 1937 .

[9]  Qiang Yang,et al.  Lifelong Machine Learning Systems: Beyond Learning Algorithms , 2013, AAAI Spring Symposium: Lifelong Machine Learning.

[10]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[11]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[12]  Andrea Bonarini,et al.  Transfer of samples in batch reinforcement learning , 2008, ICML '08.

[13]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[14]  M. M. Hassan Mahmud,et al.  Clustering Markov Decision Processes For Continual Transfer , 2013, ArXiv.

[15]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[16]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[17]  Shimon Whiteson,et al.  V-MAX: tempered optimism for better PAC reinforcement learning , 2012, AAMAS.

[18]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Jonathan P. How,et al.  Improving PAC Exploration Using the Median Of Means , 2016, NIPS.

[21]  Michail G. Lagoudakis,et al.  On the locality of action domination in sequential decision making , 2010, ISAIM.

[22]  C. Villani Optimal Transport: Old and New , 2008 .

[23]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[24]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[25]  Alessandro Lazaric,et al.  Transfer in Reinforcement Learning: A Framework and a Survey , 2012, Reinforcement Learning.

[26]  J.L. Carroll,et al.  Task similarity measures for transfer in reinforcement learning task libraries , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[27]  Alan Fern,et al.  Multi-task reinforcement learning: a hierarchical Bayesian approach , 2007, ICML '07.

[28]  Michael L. Littman,et al.  Policy and Value Transfer in Lifelong Reinforcement Learning , 2018, ICML.

[29]  Satinder P. Singh,et al.  Transfer via soft homomorphisms , 2009, AAMAS.