Bias-Variance Error Bounds for Temporal Difference Updates

Temporal difference (td) algorithms are used in reinforcement learning to compute estimates of the value of a given policy in an unknown Markov decision process (policy evaluation). We give rigorous upper bounds on the error of the closely related phased td algorithms (which differ from the standard updates in their treatment of the learning rate) as a function of the amount of experience. These upper bounds prove exponentially fast convergence, with both the rate of convergence and the asymptote strongly dependent on the length of the backups k or the parameter . Our bounds give formal verification to the well-known intuition that td methods are subject to a bias-variance tradeoff, and they lead to schedules for k and that are predicted to be better than any fixed values for these parameters. We give preliminary experimental confirmation of our theory for a version of the random walk problem.