论文信息 - Nonparametric Return Distribution Approximation for Reinforcement Learning

Nonparametric Return Distribution Approximation for Reinforcement Learning

Standard Reinforcement Learning (RL) aims to optimize decision-making rules in terms of the expected return. However, especially for risk-management purposes, other criteria such as the expected shortfall are sometimes preferred. Here, we describe a method of approximating the distribution of returns, which allows us to derive various kinds of information about the returns. We first show that the Bellman equation, which is a recursive formula for the expected return, can be extended to the cumulative return distribution. Then we derive a nonparametric return distribution estimator with particle smoothing based on this extended Bellman equation. A key aspect of the proposed algorithm is to represent the recursion relation in the extended Bellman equation by a simple replacement procedure of particles associated with a state by using those of the successor state. We show that our algorithm leads to a risk-sensitive RL paradigm. The usefulness of the proposed approach is demonstrated through numerical experiments.

[1] A. Kolmogoroff. Confidence Limits for an Unknown Distribution Function , 1941 .

[2] W. Feller. On the Kolmogorov–Smirnov Limit Theorems for Empirical Distributions , 1948 .

[3] Washington Hilton. NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE , 1983 .

[4] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[5] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[7] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[8] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[9] R. Rockafellar,et al. Optimization of conditional value-at risk , 2000 .

[10] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[11] Makoto Sato,et al. TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[12] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[13] Timothy J. Robinson,et al. Sequential Monte Carlo Methods in Practice , 2003 .

[14] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[15] A. Moore,et al. Learning decisions: robustness, uncertainty, and approximation , 2004 .

[16] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[17] Fritz Wysotzki,et al. Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[18] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[19] Mohammad Ghavamzadeh,et al. Bayesian actor-critic algorithms , 2007, ICML '07.

[20] Hisashi Kashima. Risk-Sensitive Learning via Minimization of Empirical Conditional Value-at-Risk , 2007, IEICE Trans. Inf. Syst..

[21] Louis Wehenkel,et al. Risk-aware decision making and dynamic programming , 2008 .

[22] Masashi Sugiyama,et al. Least absolute policy iteration for robust value function approximation , 2009, 2009 IEEE International Conference on Robotics and Automation.

[23] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[24] Richard S. Sutton,et al. Reinforcement Learning , 1992, Handbook of Machine Learning.