Trading Performance for Stability in Markov Decision Processes

We study the complexity of central controller synthesis problems for finite-state Markov decision processes, where the objective is to optimize both the expected mean-payoff performance of the system and its stability. We argue that the basic theoretical notion of expressing the stability in terms of the variance of the mean-payoff (called global variance in our paper) is not always sumcient, since it ignores possible instabilities on respective runs. For this reason we propose alernative definitions of stability, which we call local and hybrid variance, and which express how rewards on each run deviate from the run's own mean-payoff and from the expected mean-payoff, respectively. We show that a strategy ensuring both the expected mean-payoff and the variance below given bounds requires randomization and memory, under all the above semantics of variance. We then look at the problem of determining whether there is a such a strategy. For the global variance, we show that the problem is in PSPACE, and that the answer can be approximated in pseudo-polynomial time. For the hybrid variance, the analogous decision problem is in NP, and a polynomial-time approximating algorithm also exists. For local variance, we show that the decision problem is in NP. Since the overall performance can be traded for stability (and vice versa), we also present algorithms for approximating the associated Pareto curve in all the three cases. Finally, we study a special case of the decision problems, where we require a given expected mean-payoff together with zero variance. Here we show that the problems can be all solved in polynomial time.

[1]  Thomas A. Henzinger,et al.  Markov Decision Processes with Multiple Objectives , 2006, STACS.

[2]  Krishnendu Chatterjee,et al.  Trading Performance for Stability in Markov Decision Processes , 2013, 2013 28th Annual ACM/IEEE Symposium on Logic in Computer Science.

[3]  Krishnendu Chatterjee,et al.  Faster and dynamic algorithms for maximal end-component decomposition and related graph problems in probabilistic verification , 2011, SODA '11.

[4]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[5]  Jerzy A. Filar,et al.  Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[6]  Krishnendu Chatterjee,et al.  Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes , 2011, 2011 IEEE 26th Annual Symposium on Logic in Computer Science.

[7]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[8]  Kousha Etessami,et al.  Multi-Objective Model Checking of Markov Decision Processes , 2007, Log. Methods Comput. Sci..

[9]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[10]  Krishnendu Chatterjee,et al.  An O(n2) time algorithm for alternating Büchi games , 2011, SODA.

[11]  Stephen A. Vavasis,et al.  Approximation algorithms for indefinite quadratic programming , 1992, Math. Program..

[12]  John F. Canny,et al.  Some algebraic and geometric computations in PSPACE , 1988, STOC '88.

[13]  Krishnendu Chatterjee,et al.  Quantitative stochastic parity games , 2004, SODA '04.

[14]  D. Vere-Jones Markov Chains , 1972, Nature.

[15]  J. Norris Appendix: probability and measure , 1997 .

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17]  Marta Z. Kwiatkowska,et al.  Pareto Curves for Probabilistic Model Checking , 2012, ATVA.

[18]  Kellen Petersen August Real Analysis , 2009 .

[19]  Kun-Jen Chung Mean-Variance Tradeoffs in an Undiscounted MDP: The Unichain Case , 1994, Oper. Res..

[20]  Stephen A. Vavasis,et al.  Quadratic Programming is in NP , 1990, Inf. Process. Lett..

[21]  Matthew J. Sobel,et al.  Mean-Variance Tradeoffs in an Undiscounted MDP , 1994, Oper. Res..

[22]  U. Rieder,et al.  Markov Decision Processes , 2010 .