Distributional reinforcement learning with linear function approximation

Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)'s analysis of the C51 algorithm in terms of the Cramer distance, but their results only apply to the tabular setting and ignore C51's use of a softmax to produce normalized distributions. In this paper we adapt the Cramer distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model's prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.

[1]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[2]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[3]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[6]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[9]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[10]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[11]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[12]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[13]  Yee Whye Teh,et al.  An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[14]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[15]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.