论文信息 - Implicit Quantile Networks for Distributional Reinforcement Learning - 字舞流文

Implicit Quantile Networks for Distributional Reinforcement Learning

In this work, we build on recent advances in distributional reinforcement learning to give a generally applicable, flexible, and state-of-the-art distributional variant of DQN. We achieve this by using quantile regression to approximate the full quantile function for the state-action return distribution. By reparameterizing a distribution over the sample space, this yields an implicitly defined return distribution and gives rise to a large class of risk-sensitive policies. We demonstrate improved performance on the 57 Atari 2600 games in the ALE, and use our algorithm's implicitly defined distributions to study the effects of risk-sensitive policies in Atari games.

Rémi Munos | David Silver | Georg Ostrovski | Will Dabney | D. Silver | Georg Ostrovski | R. Munos | Will Dabney | David Silver

[1] E. Rowland. Theory of Games and Economic Behavior , 1946, Nature.

[2] R. Howard,et al. Risk-Sensitive Markov Decision Processes , 1972 .

[3] S. C. Jaquette. Markov Decision Processes with a New Optimality Criterion: Discrete Time , 1973 .

[4] Frederick R. Forst,et al. On robust estimation of the location parameter , 1980 .

[5] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[6] M. Yaari. The Dual Theory of Choice under Risk , 1987 .

[7] D. White. Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[8] A. Tversky,et al. Advances in prospect theory: Cumulative representation of uncertainty , 1992 .

[9] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[11] Shaun S. Wang. Premium Calculation by Transforming the Layer Premium Density , 1996, ASTIN Bulletin.

[12] Richard Gonzalez,et al. Curvature of the Probability Weighting Function , 1996 .

[13] Daniel Hernández-Hernández,et al. Risk Sensitive Markov Decision Processes , 1997 .

[14] A. Müller. Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[15] Richard Gonzalez,et al. On the Shape of the Probability Weighting Function , 1999, Cognitive Psychology.

[16] Shaun S. Wang. A CLASS OF DISTORTION OPERATORS FOR PRICING FINANCIAL AND INSURANCE RISKS , 2000 .

[17] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[18] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[19] Masashi Sugiyama,et al. Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[20] Matthieu Geist,et al. Kalman Temporal Differences , 2010, J. Artif. Intell. Res..

[21] Masashi Sugiyama,et al. Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[22] Tor Lattimore,et al. PAC Bounds for Discounted MDPs , 2012, ALT.

[23] Jan Dhaene,et al. Remarks on quantiles and distortion risk measures , 2012 .

[24] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[25] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[26] Mohammad Ghavamzadeh,et al. Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[27] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[28] Shane Legg,et al. Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[29] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[30] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[31] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[32] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[33] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[34] Kuan-Ting Yu,et al. More than a million ways to be pushed. A high-fidelity experimental dataset of planar pushing , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[35] Yee Whye Teh,et al. Particle Value Functions , 2017, ICLR.

[36] O. Bousquet,et al. From optimal transport to generative modeling: the VEGAN cookbook , 2017, 1705.07642.

[37] Catholijn M. Jonker,et al. Efficient exploration with Double Uncertain Value Networks , 2017, ArXiv.

[38] Léon Bottou,et al. Wasserstein Generative Adversarial Networks , 2017, ICML.

[39] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[40] Marco Pavone,et al. How Should a Robot Assess Risk? Towards an Axiomatic Theory of Risk in Robotics , 2017, ISRR.

[41] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[42] Shane Legg,et al. Noisy Networks for Exploration , 2017, ICLR.

[43] Bernhard Schölkopf,et al. Wasserstein Auto-Encoders , 2017, ICLR.

[44] Yee Whye Teh,et al. An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[45] Marc G. Bellemare,et al. The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[46] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[47] Naomi S. Altman,et al. Quantile regression , 2019, Nature Methods.