High-Confidence Off-Policy (or Counterfactual) Variance Estimation

Many sequential decision-making systems leverage data collected using prior policies to propose a new policy. For critical applications, it is important that high-confidence guarantees on the new policy’s behavior are provided before deployment, to ensure that the policy will behave as desired. Prior works have studied high-confidence off-policy estimation of the expected return, however, high-confidence off-policy estimation of the variance of returns can be equally critical for high-risk applications. In this paper we tackle the previously open problem of estimating and bounding, with high confidence, the variance of returns from off-policy data. Introduction Reinforcement learning (RL) has emerged as a promising method for solving sequential decision-making problems (Sutton and Barto 2018). Deploying RL to real-world applications, however, requires additional consideration of reliability, which has been relatively understudied. Specifically, it is often desirable to provide high-confidence guarantees on the behavior of a given policy, before deployment, to ensure that the policy will behave as desired. Prior works in RL have studied the problem of providing high-confidence guarantees on the expected return of an evaluation policy, π, using only data collected from a currently deployed policy called the behavior policy, β (Thomas, Theocharous, and Ghavamzadeh 2015; Hanna, Stone, and Niekum 2017; Kuzborskij et al. 2020). Analogously, researchers have also studied the problem of counter-factually estimating and bounding the average treatment effect, with high confidence, using data from past treatments (Bottou et al. 2013). While these methods present important contributions towards developing practical algorithms, real-world problems may require additional consideration of the variance of returns (effect) under any new policy (treatment) before it can be deployed responsibly. For applications that have high stakes in the terms of financial cost or public well-being, only providing guarantees on the mean outcome might not be sufficient. Analysis of variance (ANOVA) has therefore become a de-facto standard for many industrial and medical applications (Tabachnick and Fidell 2007). Similarly, analysis of variance can Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Illustrative example of the distributions of returns from a behavior policy β, and evaluation policy π, along with the importance weighted returns ρ, discussed later. Given trajectories from the behavior policy β, we aim to estimate and bound the variance, σ(π), of returns under an evaluation policy π, with high confidence. Note that the distribution of importance-weighted returns ρ has the mean value μ(π), but might have variance not equal to σ(π). inform numerous real-world applications of RL. For example, (a) analysing the variance of outcomes in a robotics application (Kuindersma, Grupen, and Barto 2013), (b) ensuring that the variance of outcomes for a medical treatment is not high, (c) characterizing the variance of customer experiences for a recommendation system (Teevan et al. 2009), or (d) limiting the variability of the performance of an autonomous driving system (Montgomery 2007). More generally, variance estimation can be used to account for risk in decision-making by designing objectives that maximize the mean of returns but minimize the variance of returns (Sato, Kimura, and Kobayashi 2001; Di Castro, Tamar, and Mannor 2012; La and Ghavamzadeh 2013). Variance estimates have also been shown to be useful for automatically adapting hyper-parameters, like the exploration rate (Sakaguchi and Takano 2004) or λ for eligibility-traces (White and White 2016), and might also inform other methods that depend on the entire distribution of returns (Bellemare, Dabney, and Munos 2017; Dabney et al. 2017). Despite the wide applicability of variance analysis, estimating and bounding the variance of returns with high confidence, using only off-policy data, has remained an understudied problem. In this paper, we first formalize the problem statement; an illustration of which is provided in Figure 1. We show that the typical use of importance sampling (IS) in RL only corrects for the mean, and so ar X iv :2 10 1. 09 84 7v 1 [ cs .L G ] 2 5 Ja n 20 21 it does not directly provide unbiased off-policy estimates of variance. We then present an off-policy estimator of the variance of returns that uses IS twice, together with a simple double-sampling technique. To reduce the variance of the estimator, we extend the per-decision IS technique (Precup 2000) to off-policy variance estimation. Building upon this estimator, we provide confidence intervals for the variance using (a) concentration inequalities, and (b) statistical bootstrapping. Advantages: The proposed variance estimator has several advantages: (a) it is a model-free estimator and can thus be used irrespective of the environment complexity, (b) it requires only off-policy data and can therefore be used before actual policy deployment, (c) it is unbiased and consistent. For high-confidence guarantees, (d) we provide both upper and lower confidence intervals for the variance that have guaranteed coverage (that is, they hold with any desired confidence level and without requiring false assumptions), and (e) we also provide bootstrap confidence intervals, which are approximate but often more practical. Limitations: The proposed off-policy estimator of the variance relies upon IS and thus inherits its limitations. Namely, (a) it requires knowledge of the action probabilities from the behavior policy β, (b) it requires that the support of the trajectories under the evaluation policy π is a subset of the support under the behavior policy β, and (c) the variance of the estimator scales exponentially with the length of the trajectory (Guo, Thomas, and Brunskill 2017; Liu et al. 2018). Background and Problem Statement A Markov decision process (MDP) is a tuple (S,A,P,R, γ, d0), where S is the set of states, A is the set of actions, P is the transition function, R is the reward function, γ ∈ [0, 1) is the discount factor, and d0 is the starting state distribution.1 A policy π is a distribution over the actions conditioned on the state, i.e., π(a|s) represents the probability of taking action a in state s. We assume that the MDP has finite horizon T , after which any action leads to an absorbing state S(∞). In general, we will use subscripts with parentheses for the timestep and subscript without parentheses to indicate the episode number. Let Ri(j) ∈ [Rmin, Rmax] represent the reward observed at timestep j of the episode i. Let the random variable Gi := ∑T j=0 γ Ri(j) be the return for episode i. Let c := (1 − γ )/(1 − γ) so that the minimum and the maximum returns possible are Gmin := cRmin and Gmax := cRmax, respectively. Let μ(π) := Eπ[G] be the expected return, and σ(π) := Vπ[G] be the variance of returns, where the subscript π denotes that the trajectories are generated using policy π. We formulate the problem in terms of MDPs, but it can analogously be formulated in terms of structural causal models. (Pearl 2009). For simplicity, we consider finite states and actions, but our results extend to POMDPs (by replacing states with observations) and to continuous states and actions (by appropriately replacing summations with integrals), and to infinite horizons (T := ∞). LetH (i):(j) be the set of all possible trajectories for a policy π, from timestep i to timestep j. LetH denote a complete trajectory: (S(0), A(0),Pr(A(0)|S(0)), R(0), S(1), ..., S(∞)), where T is the horizon length, and S(0) is sampled from d0. Let D be a set of n trajectories {Hi}i=1 generated using behavior policies {βi}i=1, respectively. Let ρi(0, T ) := ∏T j=0 π(Ai(j)|Si(j)) βi(Ai(j)|Si(j)) denote the product of importance ratios from timestep 0 to T . For brevity, when the range of timesteps is not necessary, we write ρi := ρi(0, T ). Similarly, when referring to ρi for an arbitrary i ∈ {1, . . . , n}, we often write ρ. With this notation, we now formalize the offpolicy variance estimation (OVE) and the high-confidence off-policy variance estimation (HCOVE) problems. OVE Problem: Given a set of trajectories D and an evaluation policy π, we aim to find an estimator σ̂ n that is both an unbiased and consistent estimator of σ(π), i.e., E[σ̂ n] = σ(π), σ̂ n a.s. −→ σ(π). HCOVE Problem: Given a set of trajectories D, an evaluation policy π, and a confidence level 1− δ, we aim to find a confidence interval C := [vlb, vub], such that Pr ( σ(π) ∈ C ) ≥ 1− δ. Remark 1. It is worth emphasizing that the OVE problem is about estimating the variance of returns, and not the variance of the estimator of the mean of returns. These problems would not be possible to solve if the trajectories in D are not informative about the trajectories that are possible under π. For example, if D has no trajectory that could be observed if policy π were to be executed, then D provides little or no information about the possible outcomes under π. To avoid this case, we make the following common assumption (Precup 2000), which is satisfied if (βi(a|s) = 0) =⇒ (π(a|s) = 0) for all s ∈ S, a ∈ A, and i ∈ {1, . . . , n}. Assumption 1. The setD contains independent trajectories generated using behavior policies {βi}i=1, such that ∀i, H (0):(T ) ⊆ H βi (0):(T ). The methods that we derive, and IS methods in general, do not require complete knowledge of {βi}i=1 (which might be parameterized using deep neural networks and might be hard to store). Only the probabilities, βi(a|s), for states s and actions a present in D are required. For simplicity, we restrict our notation to a single behavior policy β, such that ∀i, βi = β. Naïve Methods In the on-policy setting, computing an estimate of μ(π) or σ(π) is trivial—sample n trajectories using π and compute the sample mean or variance of the observed returns, {Gi}

[1]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[2]  Stephen G. Donald,et al.  Estimation and inference for distribution functions and quantile functions in treatment effect models , 2014 .

[3]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[4]  Cyrus Derman,et al.  SOME CONTRIBUTIONS TO THE THEORY OF , 2016 .

[5]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[6]  Mohammad Ghavamzadeh,et al.  Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[7]  Shie Mannor,et al.  Variance Adjusted Actor Critic Algorithms , 2013, ArXiv.

[8]  Yutaka Sakaguchi,et al.  Reliability of internal prediction/estimation and its application. I. Adaptive action selection reflecting reliability of value function , 2004, Neural Networks.

[9]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[10]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[11]  Scott Kuindersma,et al.  Variable risk control via stochastic optimization , 2013, Int. J. Robotics Res..

[12]  Eric Walter,et al.  Interval methods for nonlinear identification and robust control , 2002, Proceedings of the 41st IEEE Conference on Decision and Control, 2002..

[13]  Martha White,et al.  Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return , 2018, UAI.

[14]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[15]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[16]  W. Loh,et al.  A comparison of tests of equality of variances , 1996 .

[17]  John A. Nelder,et al.  The interpretation of negative components of variance , 1954 .

[18]  K. Hoover,et al.  Counterfactuals and Causal Structure , 2009 .

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Shie Mannor,et al.  Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..

[21]  E. S. Pearson THE ANALYSIS OF VARIANCE IN CASES OF NON-NORMAL VARIATION , 1931 .

[22]  Ilja Kuzborskij,et al.  Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting , 2020, ArXiv.

[23]  J. Shao Bootstrap estimation of the asymptotic variances of statistical functionals , 1990 .

[24]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[25]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[26]  Paul W. Mielke,et al.  Negative Variance Estimates and Statistical Dependence in Nested Sampling , 1968 .

[27]  H. Levene Robust tests for equality of variances , 1961 .

[28]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[29]  V. Chernozhukov,et al.  Inference on Counterfactual Distributions , 2009, 0904.0951.

[30]  Guohua Pan,et al.  On a levene type test for equality of two variances , 1999 .

[31]  Egon S. Pearson,et al.  THE DISTRIBUTION OF FREQUENCY CONSTANTS IN SMALL SAMPLES FROM NON-NORMAL SYMMETRICAL AND SKEW POPULATIONS , 1929 .

[32]  Mateu Sbert,et al.  Multiple importance sampling revisited: breaking the bounds , 2018, EURASIP J. Adv. Signal Process..

[33]  A. García-Pérez Chi-Square Tests Under Models Close to the Normal Distribution , 2006 .

[34]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[35]  Meysam Bastani,et al.  Model-Free Intelligent Diabetes Management Using Machine Learning , 2014 .

[36]  Illtyd Trethowan Causality , 1938 .

[37]  G. Box NON-NORMALITY AND TESTS ON VARIANCES , 1953 .

[38]  Blaise Melly,et al.  Estimation of counterfactual distributions using quantile regression , 2006 .

[39]  J. Wooldridge Introduction to Econometrics , 2013 .

[40]  Ilya Kostrikov,et al.  Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation , 2020, ArXiv.

[41]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[42]  T. Schaul,et al.  Conditional Importance Sampling for Off-Policy Learning , 2019, AISTATS.

[43]  B. Efron,et al.  Bootstrap confidence intervals , 1996 .

[44]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[45]  Makoto Sato,et al.  TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[46]  Barbara G. Tabachnick,et al.  Experimental designs using ANOVA , 2006 .

[47]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[48]  R. L. Anderson Negative Variance Estimates , 1965 .

[49]  C. Cobelli,et al.  The UVA/PADOVA Type 1 Diabetes Simulator , 2014, Journal of diabetes science and technology.

[50]  Marcello Restelli,et al.  Optimistic Policy Optimization via Multiple Importance Sampling , 2019, ICML.

[51]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[52]  Fred Spiring,et al.  Introduction to Statistical Quality Control , 2007, Technometrics.

[53]  Martha White,et al.  A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[54]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[55]  Marcello Restelli,et al.  Importance Sampling Techniques for Policy Optimization , 2020, J. Mach. Learn. Res..

[56]  Peter Stone,et al.  Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation , 2016, AAAI.