论文信息 - A Study of Off-policy Learning in Computational Sustainability

A Study of Off-policy Learning in Computational Sustainability

Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behavior policy. While several methods are available for addressing off-policy problems, the existing literature does not offer much in terms of identifying the best-performing ones. In this paper, we conduct an in-depth comparative study of off-policy evaluation methods in non-bandit, finite-horizon MDPs, using a well-known Mallard population dynamics model (Anderson, 1975). We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.

Joelle Pineau | Doina Precup

[1] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[2] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[3] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[4] Christopher J. Fonnesbeck,et al. SOLVING DYNAMIC WILDLIFE RESOURCE OPTIMIZATION PROBLEMS USING REINFORCEMENT LEARNING , 2005 .

[5] David R. Anderson. Optimal Exploitation Strategies for an Animal Population in a Markovian Environment: A Theory and an Example , 1975 .

[6] David B. Dunson,et al. Approximate Dynamic Programming for Storage Problems , 2011, ICML.

[7] John N. Tsitsiklis,et al. Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[8] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9] Joelle Pineau,et al. Treating Epilepsy via Adaptive Neurostimulation: a Reinforcement Learning Approach , 2009, Int. J. Neural Syst..

[10] E. Ziegel. Modern Mathematical Statistics , 1989 .

[11] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.