论文信息 - An Analysis of Categorical Distributional Reinforcement Learning

An Analysis of Categorical Distributional Reinforcement Learning

Distributional approaches to value-based reinforcement learning model the entire distribution of returns, rather than just their expected values, and have recently been shown to yield state-of-the-art empirical performance. This was demonstrated by the recently proposed C51 algorithm, based on categorical distributional reinforcement learning (CDRL) [Bellemare et al., 2017]. However, the theoretical properties of CDRL algorithms are not yet well understood. In this paper, we introduce a framework to analyse CDRL algorithms, establish the importance of the projected distributional Bellman operator in distributional RL, draw fundamental connections between CDRL and the Cram\'er distance, and give a proof of convergence for sample-based categorical distributional reinforcement learning algorithms.

[1] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[2] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[3] Marc G. Bellemare,et al. The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[4] Moshe Shaked,et al. Stochastic orders and their applications , 1994 .

[5] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[6] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[7] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[8] Shie Mannor,et al. Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..

[9] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[10] Masashi Sugiyama,et al. Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[11] J. Norris. Appendix: probability and measure , 1997 .

[12] Patrick Billingsley,et al. Probability and Measure. , 1986 .

[13] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[15] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[16] Mohammad Ghavamzadeh,et al. Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[17] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[18] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[19] Masashi Sugiyama,et al. Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.