论文信息 - The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation - 字舞流文

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

We study the problem of temporal-difference-based policy evaluation in reinforcement learning. In particular, we analyse the use of a distributional reinforcement learning algorithm, quantile temporal-difference learning (QTD), for this task. We reach the surprising conclusion that even if a practitioner has no interest in the return distribution beyond the mean, QTD (which learns predictions about the full distribution of returns) may offer performance superior to approaches such as classical TD learning, which predict only the mean return, even in the tabular setting.

Marc G. Bellemare | R. Munos | Mark Rowland | Will Dabney | Clare Lyle | Yunhao Tang

[1] Georg Ostrovski,et al. Distributional Reinforcement Learning , 2023 .

[2] É. Moulines,et al. One-Step Distributional Reinforcement Learning , 2023, ArXiv.

[3] Daniel Russo,et al. On the Statistical Benefits of Temporal Difference Learning , 2023, ICML.

[4] Marc G. Bellemare,et al. An Analysis of Quantile Temporal-Difference Learning , 2023, ArXiv.

[5] Jimmy Ba,et al. Mastering Diverse Domains through World Models , 2023, ArXiv.

[6] Aja Huang,et al. Discovering faster matrix multiplication algorithms with reinforcement learning , 2022, Nature.

[7] Ke Sun,et al. How Does Value Distribution in Distributional Reinforcement Learning Help Optimization? , 2022, ArXiv.

[8] Thomas J. Walsh,et al. Outracing champion Gran Turismo drivers with deep reinforcement learning , 2022, Nature.

[9] Ke Sun,et al. Distributional Reinforcement Learning via Sinkhorn Iterations , 2022, ArXiv.

[10] Gergely Neu,et al. Robustness and risk management via distributional dynamic programming , 2021, ArXiv.

[11] Georgios B. Giannakis,et al. Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes , 2021, ArXiv.

[12] Nicolas Bondoux,et al. A Cramér Distance perspective on Quantile Regression based Distributional Reinforcement Learning , 2021, AISTATS.

[13] Gilles Louppe,et al. Distributional Reinforcement Learning with Unconstrained Monotonic Neural Networks , 2021, Neurocomputing.

[14] Ivo Danihelka,et al. Muesli: Combining Improvements in Policy Optimization , 2021, ICML.

[15] Sameera S. Ponda,et al. Autonomous navigation of stratospheric balloons using reinforcement learning , 2020, Nature.

[16] Shijie Huang,et al. Exploiting Distributional Temporal Difference Learning to Deal with Tail Risk , 2020, Risks.

[17] O. Pietquin,et al. Munchausen Reinforcement Learning , 2020, NeurIPS.

[18] Svetha Venkatesh,et al. Distributional Reinforcement Learning via Moment Matching , 2020, AAAI.

[19] Adam White,et al. Gradient Temporal-Difference Learning with Regularized Corrections , 2020, ICML.

[20] Jaime Fern'andez del R'io,et al. Array programming with NumPy , 2020, Nature.

[21] Marc G. Bellemare,et al. The Value-Improvement Path: Towards Better Representations for Reinforcement Learning , 2020, AAAI.

[22] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[23] Tie-Yan Liu,et al. Fully Parameterized Quantile Function for Distributional Reinforcement Learning , 2019, NeurIPS.

[24] Karol Hausman,et al. Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping , 2019, Robotics: Science and Systems.

[25] R. Munos,et al. Adaptive Trade-Offs in Off-Policy Learning , 2019, AISTATS.

[26] Joel Nothman,et al. SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[27] Marc G. Bellemare,et al. Statistics and Samples in Distributional Reinforcement Learning , 2019, ICML.

[28] Marc G. Bellemare,et al. A Comparative Analysis of Expected and Distributional Reinforcement Learning , 2019, AAAI.

[29] Karl Tuyls,et al. Robust temporal difference learning for critical domains , 2019, AAMAS.

[30] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[31] Martha White,et al. Improving Regression Performance with Distributional Losses , 2018, ICML.

[32] Yee Whye Teh,et al. An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[33] Matthew W. Hoffman,et al. Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[34] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[35] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[36] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[37] Richard S. Sutton,et al. Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[38] Martha White,et al. A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[39] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[40] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[41] Bo Liu,et al. Regularized Off-Policy TD-Learning , 2012, NIPS.

[42] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[43] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[44] Masashi Sugiyama,et al. Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[45] C. Villani. Optimal Transport: Old and New , 2008 .

[46] John D. Hunter,et al. Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[47] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[48] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[49] Michael Kearns,et al. Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[50] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[51] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[52] K. I. M. McKinnon,et al. On the Generation of Markov Decision Processes , 1995 .

[53] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[54] Terrence J. Sejnowski,et al. TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[55] Martin A. Riedmiller,et al. A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[56] Peter Dayan,et al. The convergence of TD(λ) for general λ , 1992, Machine Learning.

[57] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[58] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[59] C. J. Lawrence. Robust estimates of location : survey and advances , 1975 .

[60] J. Gastwirth. ON ROBUST PROCEDURES , 1966 .

[61] H. Robbins. A Stochastic Approximation Method , 1951 .

[62] F. Mosteller. On Some Useful "Inefficient" Statistics , 1946 .

[63] P. J. Daniell. Observations Weighted According to Order , 1920 .

[64] J. Z. Kolter,et al. The Pitfalls of Regularization in Off-Policy TD Learning , 2022, NeurIPS.

[65] P. Poupart,et al. Distributional Reinforcement Learning with Monotonic Splines , 2022, ICLR.

[66] Dominik Meyer,et al. Accelerated Gradient Algorithms for Robust Temporal Difference Learning , 2021 .

[67] Xingdong Feng,et al. Non-Crossing Quantile Regression for Distributional Reinforcement Learning , 2020, NeurIPS.

[68] Vysoké Učení,et al. Statistical Language Models Based on Neural Networks , 2012 .

[69] R. Koenker,et al. Regression Quantiles , 2007 .

[70] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[71] Michael I. Jordan,et al. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms , 1994, Neural Computation.

[72] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[73] Frederick R. Forst,et al. On robust estimation of the location parameter , 1980 .

[74] D. Allen,et al. Quantile Regression , 2022 .