论文信息 - Incrementally Learning Functions of the Return

Incrementally Learning Functions of the Return

Temporal difference methods enable efficient estimation of value functions in reinforcement learning in an incremental fashion, and are of broader interest because they correspond learning as observed in biological systems. Standard value functions correspond to the expected value of a sum of discounted returns. While this formulation is often sufficient for many purposes, it would often be useful to be able to represent functions of the return as well. Unfortunately, most such functions cannot be estimated directly using TD methods. We propose a means of estimating functions of the return using its moments, which can be learned online using a modified TD algorithm. The moments of the return are then used as part of a Taylor expansion to approximate analytic functions of the return.

Vincent Liu | Muhammad Zaheer | Brendan Bennett | Wesley Chung

[1] Martha White,et al. A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[2] R. Munos,et al. Influence and variance of a Markov chain: application to adaptive discretization in optimal control , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[3] Martha White,et al. Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods , 2018, ArXiv.

[4] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[5] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[6] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[7] Karl J. Friston,et al. Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[8] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[9] Richard S. Sutton,et al. Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods , 2018 .

[10] Shie Mannor,et al. Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..