Implicit Quantile Networks for Distributional Reinforcement Learning

In this work, we build on recent advances in distributional reinforcement learning to give a generally applicable, flexible, and state-of-the-art distributional variant of DQN. We achieve this by using quantile regression to approximate the full quantile function for the state-action return distribution. By reparameterizing a distribution over the sample space, this yields an implicitly defined return distribution and gives rise to a large class of risk-sensitive policies. We demonstrate improved performance on the 57 Atari 2600 games in the ALE, and use our algorithm's implicitly defined distributions to study the effects of risk-sensitive policies in Atari games.

[1]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[2]  R. Howard,et al.  Risk-Sensitive Markov Decision Processes , 1972 .

[3]  S. C. Jaquette Markov Decision Processes with a New Optimality Criterion: Discrete Time , 1973 .

[4]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[5]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[6]  M. Yaari The Dual Theory of Choice under Risk , 1987 .

[7]  D. White Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[8]  A. Tversky,et al.  Advances in prospect theory: Cumulative representation of uncertainty , 1992 .

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[11]  Shaun S. Wang Premium Calculation by Transforming the Layer Premium Density , 1996, ASTIN Bulletin.

[12]  Richard Gonzalez,et al.  Curvature of the Probability Weighting Function , 1996 .

[13]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[14]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[15]  Richard Gonzalez,et al.  On the Shape of the Probability Weighting Function , 1999, Cognitive Psychology.

[16]  Shaun S. Wang A CLASS OF DISTORTION OPERATORS FOR PRICING FINANCIAL AND INSURANCE RISKS , 2000 .

[17]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[18]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[19]  Masashi Sugiyama,et al.  Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[20]  Matthieu Geist,et al.  Kalman Temporal Differences , 2010, J. Artif. Intell. Res..

[21]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[22]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[23]  Jan Dhaene,et al.  Remarks on quantiles and distortion risk measures , 2012 .

[24]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[25]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[26]  Mohammad Ghavamzadeh,et al.  Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[27]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[28]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[29]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[30]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[31]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[32]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[33]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[34]  Kuan-Ting Yu,et al.  More than a million ways to be pushed. A high-fidelity experimental dataset of planar pushing , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[35]  Yee Whye Teh,et al.  Particle Value Functions , 2017, ICLR.

[36]  O. Bousquet,et al.  From optimal transport to generative modeling: the VEGAN cookbook , 2017, 1705.07642.

[37]  Catholijn M. Jonker,et al.  Efficient exploration with Double Uncertain Value Networks , 2017, ArXiv.

[38]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[39]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[40]  Marco Pavone,et al.  How Should a Robot Assess Risk? Towards an Axiomatic Theory of Risk in Robotics , 2017, ISRR.

[41]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[42]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[43]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[44]  Yee Whye Teh,et al.  An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[45]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[46]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[47]  Naomi S. Altman,et al.  Quantile regression , 2019, Nature Methods.