论文信息 - High-Confidence Off-Policy (or Counterfactual) Variance Estimation

High-Confidence Off-Policy (or Counterfactual) Variance Estimation

Many sequential decision-making systems leverage data collected using prior policies to propose a new policy. For critical applications, it is important that high-confidence guarantees on the new policy’s behavior are provided before deployment, to ensure that the policy will behave as desired. Prior works have studied high-confidence off-policy estimation of the expected return, however, high-confidence off-policy estimation of the variance of returns can be equally critical for high-risk applications. In this paper we tackle the previously open problem of estimating and bounding, with high confidence, the variance of returns from off-policy data. Introduction Reinforcement learning (RL) has emerged as a promising method for solving sequential decision-making problems (Sutton and Barto 2018). Deploying RL to real-world applications, however, requires additional consideration of reliability, which has been relatively understudied. Specifically, it is often desirable to provide high-confidence guarantees on the behavior of a given policy, before deployment, to ensure that the policy will behave as desired. Prior works in RL have studied the problem of providing high-confidence guarantees on the expected return of an evaluation policy, π, using only data collected from a currently deployed policy called the behavior policy, β (Thomas, Theocharous, and Ghavamzadeh 2015; Hanna, Stone, and Niekum 2017; Kuzborskij et al. 2020). Analogously, researchers have also studied the problem of counter-factually estimating and bounding the average treatment effect, with high confidence, using data from past treatments (Bottou et al. 2013). While these methods present important contributions towards developing practical algorithms, real-world problems may require additional consideration of the variance of returns (effect) under any new policy (treatment) before it can be deployed responsibly. For applications that have high stakes in the terms of financial cost or public well-being, only providing guarantees on the mean outcome might not be sufficient. Analysis of variance (ANOVA) has therefore become a de-facto standard for many industrial and medical applications (Tabachnick and Fidell 2007). Similarly, analysis of variance can Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Illustrative example of the distributions of returns from a behavior policy β, and evaluation policy π, along with the importance weighted returns ρ, discussed later. Given trajectories from the behavior policy β, we aim to estimate and bound the variance, σ(π), of returns under an evaluation policy π, with high confidence. Note that the distribution of importance-weighted returns ρ has the mean value μ(π), but might have variance not equal to σ(π). inform numerous real-world applications of RL. For example, (a) analysing the variance of outcomes in a robotics application (Kuindersma, Grupen, and Barto 2013), (b) ensuring that the variance of outcomes for a medical treatment is not high, (c) characterizing the variance of customer experiences for a recommendation system (Teevan et al. 2009), or (d) limiting the variability of the performance of an autonomous driving system (Montgomery 2007). More generally, variance estimation can be used to account for risk in decision-making by designing objectives that maximize the mean of returns but minimize the variance of returns (Sato, Kimura, and Kobayashi 2001; Di Castro, Tamar, and Mannor 2012; La and Ghavamzadeh 2013). Variance estimates have also been shown to be useful for automatically adapting hyper-parameters, like the exploration rate (Sakaguchi and Takano 2004) or λ for eligibility-traces (White and White 2016), and might also inform other methods that depend on the entire distribution of returns (Bellemare, Dabney, and Munos 2017; Dabney et al. 2017). Despite the wide applicability of variance analysis, estimating and bounding the variance of returns with high confidence, using only off-policy data, has remained an understudied problem. In this paper, we first formalize the problem statement; an illustration of which is provided in Figure 1. We show that the typical use of importance sampling (IS) in RL only corrects for the mean, and so ar X iv :2 10 1. 09 84 7v 1 [ cs .L G ] 2 5 Ja n 20 21 it does not directly provide unbiased off-policy estimates of variance. We then present an off-policy estimator of the variance of returns that uses IS twice, together with a simple double-sampling technique. To reduce the variance of the estimator, we extend the per-decision IS technique (Precup 2000) to off-policy variance estimation. Building upon this estimator, we provide confidence intervals for the variance using (a) concentration inequalities, and (b) statistical bootstrapping. Advantages: The proposed variance estimator has several advantages: (a) it is a model-free estimator and can thus be used irrespective of the environment complexity, (b) it requires only off-policy data and can therefore be used before actual policy deployment, (c) it is unbiased and consistent. For high-confidence guarantees, (d) we provide both upper and lower confidence intervals for the variance that have guaranteed coverage (that is, they hold with any desired confidence level and without requiring false assumptions), and (e) we also provide bootstrap confidence intervals, which are approximate but often more practical. Limitations: The proposed off-policy estimator of the variance relies upon IS and thus inherits its limitations. Namely, (a) it requires knowledge of the action probabilities from the behavior policy β, (b) it requires that the support of the trajectories under the evaluation policy π is a subset of the support under the behavior policy β, and (c) the variance of the estimator scales exponentially with the length of the trajectory (Guo, Thomas, and Brunskill 2017; Liu et al. 2018). Background and Problem Statement A Markov decision process (MDP) is a tuple (S,A,P,R, γ, d0), where S is the set of states, A is the set of actions, P is the transition function, R is the reward function, γ ∈ [0, 1) is the discount factor, and d0 is the starting state distribution.1 A policy π is a distribution over the actions conditioned on the state, i.e., π(a|s) represents the probability of taking action a in state s. We assume that the MDP has finite horizon T , after which any action leads to an absorbing state S(∞). In general, we will use subscripts with parentheses for the timestep and subscript without parentheses to indicate the episode number. Let Ri(j) ∈ [Rmin, Rmax] represent the reward observed at timestep j of the episode i. Let the random variable Gi := ∑T j=0 γ Ri(j) be the return for episode i. Let c := (1 − γ )/(1 − γ) so that the minimum and the maximum returns possible are Gmin := cRmin and Gmax := cRmax, respectively. Let μ(π) := Eπ[G] be the expected return, and σ(π) := Vπ[G] be the variance of returns, where the subscript π denotes that the trajectories are generated using policy π. We formulate the problem in terms of MDPs, but it can analogously be formulated in terms of structural causal models. (Pearl 2009). For simplicity, we consider finite states and actions, but our results extend to POMDPs (by replacing states with observations) and to continuous states and actions (by appropriately replacing summations with integrals), and to infinite horizons (T := ∞). LetH (i):(j) be the set of all possible trajectories for a policy π, from timestep i to timestep j. LetH denote a complete trajectory: (S(0), A(0),Pr(A(0)|S(0)), R(0), S(1), ..., S(∞)), where T is the horizon length, and S(0) is sampled from d0. Let D be a set of n trajectories {Hi}i=1 generated using behavior policies {βi}i=1, respectively. Let ρi(0, T ) := ∏T j=0 π(Ai(j)|Si(j)) βi(Ai(j)|Si(j)) denote the product of importance ratios from timestep 0 to T . For brevity, when the range of timesteps is not necessary, we write ρi := ρi(0, T ). Similarly, when referring to ρi for an arbitrary i ∈ {1, . . . , n}, we often write ρ. With this notation, we now formalize the offpolicy variance estimation (OVE) and the high-confidence off-policy variance estimation (HCOVE) problems. OVE Problem: Given a set of trajectories D and an evaluation policy π, we aim to find an estimator σ̂ n that is both an unbiased and consistent estimator of σ(π), i.e., E[σ̂ n] = σ(π), σ̂ n a.s. −→ σ(π). HCOVE Problem: Given a set of trajectories D, an evaluation policy π, and a confidence level 1− δ, we aim to find a confidence interval C := [vlb, vub], such that Pr ( σ(π) ∈ C ) ≥ 1− δ. Remark 1. It is worth emphasizing that the OVE problem is about estimating the variance of returns, and not the variance of the estimator of the mean of returns. These problems would not be possible to solve if the trajectories in D are not informative about the trajectories that are possible under π. For example, if D has no trajectory that could be observed if policy π were to be executed, then D provides little or no information about the possible outcomes under π. To avoid this case, we make the following common assumption (Precup 2000), which is satisfied if (βi(a|s) = 0) =⇒ (π(a|s) = 0) for all s ∈ S, a ∈ A, and i ∈ {1, . . . , n}. Assumption 1. The setD contains independent trajectories generated using behavior policies {βi}i=1, such that ∀i, H (0):(T ) ⊆ H βi (0):(T ). The methods that we derive, and IS methods in general, do not require complete knowledge of {βi}i=1 (which might be parameterized using deep neural networks and might be hard to store). Only the probabilities, βi(a|s), for states s and actions a present in D are required. For simplicity, we restrict our notation to a single behavior policy β, such that ∀i, βi = β. Naïve Methods In the on-policy setting, computing an estimate of μ(π) or σ(π) is trivial—sample n trajectories using π and compute the sample mean or variance of the observed returns, {Gi}