A Study of Off-policy Learning in Computational Sustainability

Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behavior policy. While several methods are available for addressing off-policy problems, the existing literature does not offer much in terms of identifying the best-performing ones. In this paper, we conduct an in-depth comparative study of off-policy evaluation methods in non-bandit, finite-horizon MDPs, using a well-known Mallard population dynamics model (Anderson, 1975). We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.

[1]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[2]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[3]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[4]  Christopher J. Fonnesbeck,et al.  SOLVING DYNAMIC WILDLIFE RESOURCE OPTIMIZATION PROBLEMS USING REINFORCEMENT LEARNING , 2005 .

[5]  David R. Anderson Optimal Exploitation Strategies for an Animal Population in a Markovian Environment: A Theory and an Example , 1975 .

[6]  David B. Dunson,et al.  Approximate Dynamic Programming for Storage Problems , 2011, ICML.

[7]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Joelle Pineau,et al.  Treating Epilepsy via Adaptive Neurostimulation: a Reinforcement Learning Approach , 2009, Int. J. Neural Syst..

[10]  E. Ziegel Modern Mathematical Statistics , 1989 .

[11]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12]  Naoki Abe,et al.  Optimizing debt collections using constrained reinforcement learning , 2010, KDD.

[13]  S. Murphy,et al.  An experimental design for the development of adaptive treatment strategies , 2005, Statistics in medicine.

[14]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[15]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[16]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[17]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[18]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[19]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.