论文信息 - TD algorithm for the variance of return and mean-variance reinforcement learning - 字舞流文

TD algorithm for the variance of return and mean-variance reinforcement learning

Estimating probability distributions on returns provides various sophisticated decision making schemes for control problems in Markov environments, including risk-sensitive control, efficient exploration of environments and so on. Many reinforcement learning algorithms, however, have simply relied on the expected return. This paper provides a scheme of decision making using mean and variance of returndistributions. This paper presents a TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control. Empirical results demonstrate behaviors of the algorithms and validates of the criterion for risk-avoiding sequential decision tasks.

Makoto Sato | Hajime Kimura | Shibenobu Kobayashi | H. Kimura | S. Kobayashi | Makoto Sato

[1] E. Elton. Modern portfolio theory and investment analysis , 1981 .

[2] D. White. Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[3] Anton Schwartz,et al. A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[4] Matthias Heger,et al. Consideration of risk in reinformance learning , 1994, ICML 1994.

[5] Michael I. Jordan,et al. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[6] Matthias Heger,et al. Consideration of Risk in Reinforcement Learning , 1994, ICML.

[7] E. Fernández-Gaucherand,et al. Non-standard optimality criteria for stochastic control problems , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[8] Thomas G. Dietterich,et al. High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[9] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[10] Andrew McCallum,et al. Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[11] Dimitri P. Bertsekas,et al. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[12] Daniel Hernández-Hernández,et al. Risk Sensitive Markov Decision Processes , 1997 .

[13] Ralph Neuneier,et al. Enhancing Q-Learning for Optimal Asset Allocation , 1997, NIPS.

[14] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[15] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[16] Timothy X. Brown,et al. Optimizing Admission Control while Ensuring Quality of Service in Multimedia Networks via Reinforcement Learning , 1998, NIPS.

[17] Andrew W. Moore,et al. Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[18] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.

[19] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20] SRIDHAR MAHADEVAN,et al. Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.