TD algorithm for the variance of return and mean-variance reinforcement learning

Estimating probability distributions on returns provides various sophisticated decision making schemes for control problems in Markov environments, including risk-sensitive control, efficient exploration of environments and so on. Many reinforcement learning algorithms, however, have simply relied on the expected return. This paper provides a scheme of decision making using mean and variance of returndistributions. This paper presents a TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control. Empirical results demonstrate behaviors of the algorithms and validates of the criterion for risk-avoiding sequential decision tasks.

[1]  E. Elton Modern portfolio theory and investment analysis , 1981 .

[2]  D. White Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[3]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[4]  Matthias Heger,et al.  Consideration of risk in reinformance learning , 1994, ICML 1994.

[5]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[6]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[7]  E. Fernández-Gaucherand,et al.  Non-standard optimality criteria for stochastic control problems , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[8]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[9]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[10]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[11]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[12]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[13]  Ralph Neuneier,et al.  Enhancing Q-Learning for Optimal Asset Allocation , 1997, NIPS.

[14]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[15]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[16]  Timothy X. Brown,et al.  Optimizing Admission Control while Ensuring Quality of Service in Multimedia Networks via Reinforcement Learning , 1998, NIPS.

[17]  Andrew W. Moore,et al.  Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[18]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[19]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.