Importance Resampling for Off-policy Prediction

Importance sampling (IS) is a common reweighting strategy for off-policy prediction in reinforcement learning. While it is consistent and unbiased, it can result in high variance updates to the weights for the value function. In this work, we explore a resampling strategy as an alternative to reweighting. We propose Importance Resampling (IR) for off-policy prediction, which resamples experience from a replay buffer and applies standard on-policy updates. The approach avoids using importance sampling ratios in the update, instead correcting the distribution before the update. We characterize the bias and consistency of IR, particularly compared to Weighted IS (WIS). We demonstrate in several microworlds that IR has improved sample efficiency and lower variance updates, as compared to IS and several variance-reduced IS strategies, including variants of WIS and V-trace which clips IS ratios. We also provide a demonstration showing IR improves over IS for learning a value function from images in a racing car simulator.

[1]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[2]  Richard S. Sutton,et al.  Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.

[3]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[4]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[5]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[6]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[7]  J. Hammersley SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[8]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[9]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[10]  H. Kahn,et al.  Methods of Reducing Sample Size in Monte Carlo Computations , 1953, Oper. Res..

[11]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[12]  Tom Schaul,et al.  Better Generalization with Forecasts , 2013, IJCAI.

[13]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[14]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[15]  V. Climenhaga Markov chains and mixing times , 2013 .

[16]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[17]  Sigrún Andradóttir,et al.  On the Choice of Alternative Measures in Importance Sampling with Markov Chains , 1995, Oper. Res..

[18]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[19]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[20]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[21]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[22]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[23]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[24]  Jun S. Liu,et al.  Sequential Imputations and Bayesian Missing Data Problems , 1994 .

[25]  Alan E. Gelfand,et al.  Bayesian statistics without tears: A sampling-resampling perspective , 1992 .

[26]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[27]  Richard S. Sutton,et al.  Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[28]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[29]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[30]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[31]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[32]  Luca Martino,et al.  Effective sample size for importance sampling based on discrepancy measures , 2016, Signal Process..

[33]  D. Rubin Using the SIR algorithm to simulate posterior distributions , 1988 .

[34]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[35]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[36]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[37]  Philip S. Thomas,et al.  Importance Sampling with Unequal Support , 2016, AAAI.

[38]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[39]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[40]  Ø. Skare,et al.  Improved Sampling‐Importance Resampling and Reduced Bias Importance Sampling , 2003 .