Predicting Transient Downtime in Virtual Server Systems: An Efficient Sample Path Randomization Approach

A central challenge in developing cloud datacenters Service Level Agreements is the estimation of downtime distribution of a set of provisioned servers over a service window, which is compounded by three facts. First, while steady-state probabilities have been derived for birth-death processes involving server failures and repairs, they could be highly inaccurate under transience. Furthermore, steady-state cannot be assured under typical service windows. Therefore, estimation of transient distributions is essential. Second, the processes of failures and repairs may follow any distribution and hence need to be extracted using system log data and modeled using appropriate general distributions. Third, downtime distributions over service windows depend on the number of servers and their deployment structure for a contract. We develop an efficient and generalized sample path randomization approach to precisely estimate transient probabilities under three different checkpointing strategies and three flexible failure distribution models. The estimators are unbiased, consistent, efficient and sufficient. Their asymptotic convergence is established. The estimation algorithms are computationally efficient in solving practical problems and yield rich information on transient system behaviors. The methodology is general and extensible to various server failure and repair processes characterized using birth-death modeling.

[1]  S. Stidham,et al.  Sample-Path Analysis of Queueing Systems , 1998 .

[2]  Juan A. Carrasco,et al.  Transient analysis of some rewarded Markov models using randomization with quasistationarity detection , 2004, IEEE Transactions on Computers.

[3]  Edmundo de Souza e Silva,et al.  Calculating Cumulative Operational Time Distributions of Repairable Computer Systems , 1986, IEEE Transactions on Computers.

[4]  Kishor S. Trivedi,et al.  Coverage Modeling for Dependability Analysis of Fault-Tolerant Systems , 1989, IEEE Trans. Computers.

[5]  Darli A. A. Mello,et al.  Interval availability estimation for protected connections in optical networks , 2011, Comput. Networks.

[6]  Manish Malhotra,et al.  A Computationally Efficient Technique for Transient Analysis of Repairable Markovian Systems , 1996, Perform. Evaluation.

[7]  Kishor S. Trivedi,et al.  Numerical transient analysis of markov models , 1988, Comput. Oper. Res..

[8]  Francis A. Longstaff,et al.  Valuing American Options by Simulation: A Simple Least-Squares Approach , 2001 .

[9]  L. Donatiello,et al.  On Evaluating the Cumulative Performance Distribution of Fault-Tolerant Computer Systems , 1991, IEEE Trans. Computers.

[10]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[11]  Philip Heidelberger,et al.  Analysis of Performability for Stochastic Models of Fault-Tolerant Systems , 1986, IEEE Transactions on Computers.

[12]  Gerardo Rubino,et al.  Transient Probability Functions: A Sample Path Approach , 2003, DRW.

[13]  William H. Sanders,et al.  Model-based evaluation: from dependability to security , 2004, IEEE Transactions on Dependable and Secure Computing.

[14]  Ranga Mallubhatla Markov reward models and hyperbolic systems , 1997 .

[15]  Min Xie,et al.  Reliability analysis using an additive Weibull model with bathtub-shaped failure rate function , 1996 .

[16]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[17]  Edmundo de Souza e Silva,et al.  Calculating availability and performability measures of repairable computer systems using randomization , 1989, JACM.

[18]  Rafael Pérez-Ocón,et al.  Transient analysis of a repairable system, using phase-type distributions and geometric processes , 2004, IEEE Transactions on Reliability.

[19]  William H. Sanders,et al.  Transient solution of Markov models by combining adaptive and standard uniformization , 1997 .

[20]  A. David,et al.  The least variable phase type distribution is Erlang , 1987 .