Mean-Variance Optimization in Markov Decision Processes

We consider finite horizon Markov decision processes under performance measures that involve both the mean and the variance of the cumulative reward. We show that either randomized or history-based policies can improve performance. We prove that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NP-hard for some cases, and strongly NP-hard for others. We finally offer pseudopoly-nomial exact and approximation algorithms.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  J. Cockcroft Investment in Science , 1962, Nature.

[3]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[4]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[5]  Ward Whitt,et al.  Approximations of Dynamic Programs, II , 1979, Math. Oper. Res..

[6]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[7]  M. J. Sobel,et al.  Discounted MDP's: distribution functions and exponential utility maximization , 1987 .

[8]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  E. Altman Constrained Markov Decision Processes , 1999 .

[11]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[12]  Mihalis Yannakakis,et al.  On the approximability of trade-offs and optimal access of Web sources , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[13]  Frank Riedel,et al.  Dynamic Coherent Risk Measures , 2003 .

[14]  L. Ghaoui,et al.  Robust markov decision processes with uncertain transition matrices , 2004 .

[15]  Sven Koenig,et al.  Risk-Sensitive Planning with One-Switch Utility Functions: Value Iteration , 2005, AAAI.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[18]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[19]  J. Tsitsiklis,et al.  Robust, risk-sensitive, and data-driven control of markov decision processes , 2007 .

[20]  Javier de Frutos,et al.  Approximation of Dynamic Programs , 2012 .