Model-Free and Model-Based Policy Evaluation when Causality is Uncertain

When decision-makers can directly intervene, policy evaluation algorithms give valid causal estimates. In off-policy evaluation (OPE), there may exist unobserved variables that both impact the dynamics and are used by the unknown behavior policy. These “confounders” will introduce spurious correlations and naive estimates for a new policy will be biased. We develop worst-case bounds to assess sensitivity to these unobserved confounders in finite horizons when confounders are drawn iid each period. We demonstrate that a model-based approach with robust MDPs gives sharper lower bounds by exploiting domain knowledge about the dynamics. Finally, we show that when unobserved confounders are persistent over time, OPE is far more difficult and existing techniques produce extremely conservative bounds.

[1]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[2]  J. Pearl Causal inference in statistics: An overview , 2009 .

[3]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4]  Nathan Kallus,et al.  Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning , 2020, NeurIPS.

[5]  Jennie Si,et al.  Robust Optimality for Discounted Infinite-Horizon Markov Decision Processes With Uncertain Transition Matrices , 2008, IEEE Transactions on Automatic Control.

[6]  Yao Liu,et al.  Combining Parametric and Nonparametric Models for Off-Policy Evaluation , 2019, ICML.

[7]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[8]  Avi Feller,et al.  Algorithmic Decision Making in the Presence of Unmeasured Confounding , 2018, 1805.01868.

[9]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[10]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[11]  Eric B. Laber,et al.  Dynamic treatment regimes: Technical challenges and applications , 2014 .

[12]  Shie Mannor,et al.  Off-Policy Evaluation in Partially Observable Environments , 2020, AAAI.

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Xiaojie Mao,et al.  Interval Estimation of Individual-Level Causal Effects Under Unobserved Confounding , 2018, AISTATS.

[15]  Bo Dai,et al.  Reinforcement Learning via Fenchel-Rockafellar Duality , 2020, ArXiv.

[16]  Nathan Kallus,et al.  Confounding-Robust Policy Improvement , 2018, NeurIPS.

[17]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[18]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[19]  Paul R. Rosenbaum,et al.  Overt Bias in Observational Studies , 2002 .

[20]  Alexander D'Amour,et al.  Flexible Sensitivity Analysis for Observational Studies Without Observable Implications , 2018, Journal of the American Statistical Association.

[21]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[22]  Shie Mannor,et al.  RL for Latent MDPs: Regret Guarantees and a Lower Bound , 2021, NeurIPS.

[23]  Peter Stone,et al.  Importance Sampling Policy Evaluation with an Estimated Behavior Policy , 2018, ICML.

[24]  Tyler J. VanderWeele,et al.  Sensitivity Analysis Without Assumptions , 2015, Epidemiology.

[25]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[26]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[27]  Zhiqiang Tan,et al.  A Distributional Approach for Causal Inference Using Propensity Scores , 2006 .

[28]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[29]  Emma Brunskill,et al.  Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding , 2020, NeurIPS.

[30]  Hoang Minh Le,et al.  Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning , 2019, NeurIPS Datasets and Benchmarks.

[31]  Nathan Kallus,et al.  Policy Evaluation with Latent Confounders via Optimal Balance , 2019, NeurIPS.

[32]  John Duchi,et al.  BOUNDS ON THE CONDITIONAL AND AVERAGE TREATMENT EFFECT WITH UNOBSERVED CONFOUNDING FACTORS. , 2018, Annals of statistics.

[33]  D. Rubin,et al.  Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome , 1983 .