Distributional Reinforcement Learning with Monotonic Splines

Distributional Reinforcement Learning (RL) differs from traditional RL by estimating the distribution over returns to capture the intrinsic uncertainty of MDPs. One key challenge in distributional RL lies in how to parameterize the quantile function when minimizing the Wasserstein metric of temporal differences. Existing algorithms use step functions or piecewise linear functions. In this paper, we propose to learn smooth continuous quantile functions represented by monotonic rational-quadratic splines, which also naturally solve the quantile crossing problem. Experiments in stochastic environments show that a dense estimation for quantile functions enhances distributional RL in terms of faster empirical convergence and higher rewards in most cases.

[1]  Fan Zhou,et al.  Non-decreasing Quantile Function Network with Efficient Exploration for Distributional Reinforcement Learning , 2021, IJCAI.

[2]  A. Aldo Faisal,et al.  Bayesian Distributional Policy Gradients , 2021, AAAI.

[3]  Svetha Venkatesh,et al.  Distributional Reinforcement Learning via Moment Matching , 2020, AAAI.

[4]  Dmitry Vetrov,et al.  Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics , 2020, ICML.

[5]  C. Leckie,et al.  Invertible Generative Modeling using Linear Rational Splines , 2020, AISTATS.

[6]  Yongxin Chen,et al.  Sample-based Distributional Policy Gradient , 2020, L4DC.

[7]  Tie-Yan Liu,et al.  Fully Parameterized Quantile Function for Distributional Reinforcement Learning , 2019, NeurIPS.

[8]  Iain Murray,et al.  Neural Spline Flows , 2019, NeurIPS.

[9]  Shie Mannor,et al.  Distributional Policy Optimization: An Alternative Approach for Continuous Control , 2019, NeurIPS.

[10]  John D. Martin,et al.  Stochastically Dominant Distributional Reinforcement Learning , 2019, ICML.

[11]  Bo Liu,et al.  QUOTA: The Quantile Option Architecture for Reinforcement Learning , 2018, AAAI.

[12]  Alan Fern,et al.  Learning Finite State Representations of Recurrent Policy Networks , 2018, ICLR.

[13]  Aviv Tamar,et al.  Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN , 2018, ArXiv.

[14]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[15]  Yee Whye Teh,et al.  An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[16]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[17]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[18]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[19]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[20]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[21]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[22]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[23]  J. Schulman,et al.  OpenAI Gym , 2016, ArXiv.

[24]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[25]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[26]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[27]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[28]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[29]  J. Gregory,et al.  Piecewise rational quadratic interpola-tion to monotonic data , 1982 .

[30]  Xingdong Feng,et al.  Non-Crossing Quantile Regression for Distributional Reinforcement Learning , 2020, NeurIPS.

[31]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[32]  Razvan V. Florian,et al.  Correct equations for the dynamics of the cart-pole system , 2005 .

[33]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[34]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[35]  R. Mazo On the theory of brownian motion , 1973 .

[36]  D. Allen,et al.  Quantile Regression , 2022 .