Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning

Off-policy evaluation of sequential decision policies from observational data is necessary in applications of batch reinforcement learning such as education and healthcare. In such settings, however, unobserved variables confound observed actions, rendering exact evaluation of new policies impossible, i.e., unidentifiable. We develop a robust approach that estimates sharp bounds on the (unidentifiable) value of a given policy in an infinite-horizon problem given data from another policy with unobserved confounding, subject to a sensitivity model. We consider stationary or baseline unobserved confounding and compute bounds by optimizing over the set of all stationary state-occupancy ratios that agree with a new partially identified estimating equation and the sensitivity model. We prove convergence to the sharp bounds as we collect more confounded data. Although checking set membership is a linear program, the support function is given by a difficult nonconvex optimization problem. We develop approximations based on nonconvex projected gradient descent and demonstrate the resulting bounds empirically.

[1]  Shie Mannor,et al.  Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty , 2012, ICML.

[2]  Shu Yang,et al.  SENSITIVITY ANALYSIS FOR UNMEASURED CONFOUNDING IN COARSE STRUCTURAL NESTED MEAN MODELS. , 2018, Statistica Sinica.

[3]  John Duchi,et al.  Bounds on the conditional and average treatment effect in the presence of unobserved confounders , 2018 .

[4]  John N. Tsitsiklis,et al.  On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies , 2005, Math. Oper. Res..

[5]  Elias Bareinboim,et al.  Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes , 2019, NeurIPS.

[6]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Marek Petrik,et al.  RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning , 2014, NIPS.

[9]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[10]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[11]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[12]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[13]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.

[14]  Donald K. K. Lee,et al.  Interval estimation of population means under unknown but bounded probabilities of sample selection , 2013 .

[15]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[16]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[17]  Xiaojie Mao,et al.  Interval Estimation of Individual-Level Causal Effects Under Unobserved Confounding , 2018, AISTATS.

[18]  Shie Mannor,et al.  Off-Policy Evaluation in Partially Observable Environments , 2020, AAAI.

[19]  Stephen P. Boyd,et al.  A tutorial on geometric programming , 2007, Optimization and Engineering.

[20]  Nathan Kallus,et al.  Policy Evaluation with Latent Confounders via Optimal Balance , 2019, NeurIPS.

[21]  Eitan Altman,et al.  Rate of Convergence of Empirical Measures and Costs in Controlled Markov Chains and Transient Optimality , 1994, Math. Oper. Res..

[22]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[23]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[24]  Zhiqiang Tan,et al.  A Distributional Approach for Causal Inference Using Propensity Scores , 2006 .

[25]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[26]  Arkadi Nemirovski,et al.  Robust solutions of uncertain linear programs , 1999, Oper. Res. Lett..

[27]  Egon Balas,et al.  programming: Properties of the convex hull of feasible points * , 1998 .

[28]  S. M. Robinson Stability Theory for Systems of Inequalities. Part I: Linear Systems , 1975 .

[29]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[30]  Bernhard Schölkopf,et al.  Deconfounding Reinforcement Learning in Observational Settings , 2018, ArXiv.

[31]  Z. Geng,et al.  Identifying Causal Effects With Proxy Variables of an Unmeasured Confounder. , 2016, Biometrika.

[32]  A. Galichon,et al.  Duality in Dynamic Discrete Choice Models , 2015 .

[33]  Peter Szolovits,et al.  Deep Reinforcement Learning for Sepsis Treatment , 2017, ArXiv.

[34]  Barbara E. Engelhardt,et al.  A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units , 2017, UAI.

[35]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[36]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[37]  Nathan Kallus,et al.  Confounding-Robust Policy Improvement , 2018, NeurIPS.

[38]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[39]  Max Welling,et al.  Causal Effect Inference with Deep Latent-Variable Models , 2017, NIPS 2017.

[40]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[41]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[42]  E. Altman,et al.  Markov decision problems and state-action frequencies , 1991 .