Sequential Transfer in Reinforcement Learning with a Generative Model

We are interested in how to design reinforcement learning agents that provably reduce the sample complexity for learning new tasks by transferring knowledge from previously-solved ones. The availability of solutions to related problems poses a fundamental trade-off: whether to seek policies that are expected to achieve high (yet sub-optimal) performance in the new task immediately or whether to seek information to quickly identify an optimal solution, potentially at the cost of poor initial behavior. In this work, we focus on the second objective when the agent has access to a generative model of state-action pairs. First, given a set of solved tasks containing an approximation of the target one, we design an algorithm that quickly identifies an accurate solution by seeking the state-action pairs that are most informative for this purpose. We derive PAC bounds on its sample complexity which clearly demonstrate the benefits of using this kind of prior knowledge. Then, we show how to learn these approximate tasks sequentially by reducing our transfer setting to a hidden Markov model and employing spectral methods to recover its parameters. Finally, we empirically verify our theoretical findings in simple simulated domains.

[1]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[4]  Dit-Yan Yeung,et al.  An Environment Model for Nonstationary Reinforcement Learning , 1999, NIPS.

[5]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[6]  Dit-Yan Yeung,et al.  Hidden-Mode Markov Decision Processes for Nonstationary Sequential Decision Making , 2001, Sequence Learning.

[7]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[8]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[9]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[12]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[13]  Peter Stone,et al.  Value-Function-Based Transfer for Reinforcement Learning Using Structure Mapping , 2006, AAAI.

[14]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[15]  Andrea Bonarini,et al.  Transfer of samples in batch reinforcement learning , 2008, ICML '08.

[16]  Shie Mannor,et al.  Efficient reinforcement learning in parameterized models: discrete parameters , 2008, VALUETOOLS.

[17]  Peter Stone,et al.  Transferring Instances for Model-Based Reinforcement Learning , 2008, ECML/PKDD.

[18]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[19]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[20]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[21]  L. Meng,et al.  The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm , 2010 .

[22]  Sergey Levine,et al.  Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[23]  Yoonsuck Choe,et al.  Directed Exploration in Reinforcement Learning with Transferred Knowledge , 2012, EWRL.

[24]  Alessandro Lazaric,et al.  Transfer in Reinforcement Learning: A Framework and a Survey , 2012, Reinforcement Learning.

[25]  Andrew G. Barto,et al.  Transfer in Reinforcement Learning via Shared Features , 2012, J. Mach. Learn. Res..

[26]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[27]  M. M. Hassan Mahmud,et al.  Clustering Markov Decision Processes For Continual Transfer , 2013, ArXiv.

[28]  Shiliang Sun,et al.  A survey of multi-view machine learning , 2013, Neural Computing and Applications.

[29]  Lihong Li,et al.  Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[30]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[31]  Alessandro Lazaric,et al.  Sequential Transfer in Multi-armed Bandit with Finite Set of Models , 2013, NIPS.

[32]  Paul Weng,et al.  Solving Hidden-Semi-Markov-Mode Markov Decision Problems , 2014, SUM.

[33]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[34]  Shie Mannor,et al.  Contextual Markov Decision Processes , 2015, ArXiv.

[35]  Emma Brunskill,et al.  A PAC RL Algorithm for Episodic POMDPs , 2016, AISTATS.

[36]  Kamyar Azizzadenesheli,et al.  Reinforcement Learning of POMDPs using Spectral Methods , 2016, COLT.

[37]  Yao Liu,et al.  PAC Continuous State Online Multitask Reinforcement Learning with Identification , 2016, AAMAS.

[38]  Finale Doshi-Velez,et al.  Hidden Parameter Markov Decision Processes: A Semiparametric Regression Approach for Discovering Latent Task Parametrizations , 2013, IJCAI.

[39]  Benjamin Rosman,et al.  Bayesian policy reuse , 2015, Machine Learning.

[40]  Sinno Jialin Pan,et al.  Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay , 2017, AAAI.

[41]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[42]  Finale Doshi-Velez,et al.  Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes , 2017, AAAI.

[43]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[44]  Michael L. Littman,et al.  Policy and Value Transfer in Lifelong Reinforcement Learning , 2018, ICML.

[45]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[46]  Nan Jiang,et al.  Markov Decision Processes with Continuous Side Information , 2017, ALT.

[47]  Marcello Restelli,et al.  Transfer of Value Functions via Variational Methods , 2018, NeurIPS.

[48]  Marcello Restelli,et al.  Importance Weighted Transfer of Samples in Reinforcement Learning , 2018, ICML.

[49]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[50]  Nan Jiang,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[51]  Mykel J. Kochenderfer,et al.  Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model , 2019, NeurIPS.

[52]  Wotao Yin,et al.  Does Knowledge Transfer Always Help to Learn a Better Policy? , 2019, ArXiv.

[53]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[54]  Alessandro Lazaric,et al.  Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs , 2019, NeurIPS.

[55]  Zhenguo Li,et al.  Meta Reinforcement Learning with Task Embedding and Shared Policy , 2019, IJCAI.

[56]  Yee Whye Teh,et al.  Meta reinforcement learning as task inference , 2019, ArXiv.

[57]  Marcello Restelli,et al.  Transfer of Samples in Policy Search via Multiple Importance Sampling , 2019, ICML.

[58]  Luisa M. Zintgraf,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2019, ICLR.

[59]  Single Episode Policy Transfer in Reinforcement Learning , 2019, ICLR.

[60]  Alessandro Lazaric,et al.  A Novel Confidence-Based Algorithm for Structured Bandits , 2020, AISTATS.