Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Let H := (S0, A0, R0, S1, . . . , St−1, At−1, Rt−1, St) be the first t transitions in the episode H . We call H a partial trajectory of length t. Notice that we use subscripts on trajectories to denote the trajectory’s index in D and superscripts to denote partial trajectories—H i is the first t transitions of the ith trajectory in D. Let H be the set of all possible partial trajectories of length t.