Improving the Efficiency of Markov Chain Analysis of Complex Distributed Systems

In large-scale distributed computing systems, the interactions of many independent components may lead to emergent global system behaviors with unforeseen, often detrimental, outcomes. The increasing economic importance of distributed systems such as cloud computing systems, grid computing systems, and the Internet, argues for developing analytical tools to understand, and predict, complex system behavior in order to ensure availability and reliability of computing services. In previous work, we described one such tool in which a piece-wise homogeneous Discrete Markov chain representation of a grid computing system can be systematically perturbed to predict situations that lead to marked performance degradations and system-wide failure. While the run times of the Markov chain model compared favorably with testbeds or detailed large-scale simulations, it was still often necessary to execute a sizable number of alternative perturbations of the model to identify scenarios in which system performance is likely to degrade or in which anomalous behaviors may occur. Here, we evolve our original approach and describe two novel methods for more quickly identifying portions of the Markov chain that are likely to be sensitive to perturbation. The first method involves finding cut sets, consisting of state transitions that effectively disconnect all paths in a Markov chain from the initial state to a desired end state. We show that by perturbing the state transitions in the cut set, it is possible to more quickly identify scenarios in which system performance is adversely affected. We also show this new method can be applied to larger Markov models than in our earlier work and therefore provides better scalability. We then present a second method, in which the Spectral Expansion Theorem is used to analyze the eigensystem of a set of Markov transition probability matrices (TPMs) in order to identify eigenvectors and eigenvalues that can be used to predict system performance. We describe how this second approach can also be used to indicate which state transitions, if perturbed, are likely to adversely affect system performance. Results are presented for both methods to show that they can be used to identify the same failure scenarios presented in our earlier paper (as well as additional scenarios, using the first method), while reducing the number of perturbations of the Markov model (or eliminating Markov simulation altogether, using the second method). We believe that these methods provide a basis for creating practical tools for analysis of complex systems and discuss future work towards this end.

[1]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[2]  William H. Sanders,et al.  Reduced base model construction methods for stochastic activity networks , 1989, Proceedings of the Third International Workshop on Petri Nets and Performance Models, PNPM89.

[3]  John F. Meyer,et al.  State space generation for degradable multiprocessor systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[4]  James J. Filliben,et al.  Study of proposed internet congestion control mechanisms , 2010 .

[5]  Christopher E. Dabrowski,et al.  Investigating Global Behavior in Computing Grids , 2006, IWSOS/EuroNGI.

[6]  Feiqi Deng,et al.  Finite Horizon Optimal Control of Networked Control Systems with Markov Delays , 2006, 2006 6th World Congress on Intelligent Control and Automation.

[7]  William H. Sanders,et al.  Measure-adaptive state-space construction , 2001, Perform. Evaluation.

[8]  U. Montanari,et al.  A Vertex Elimination Algorithm for Enumerating all Simple Paths in a Graph , 1975, Networks.

[9]  R. Suri,et al.  Perturbation analysis: the state of the art and research issues explained via the GI/G/1 queue , 1989, Proc. IEEE.

[10]  Sy-Yen Kuo,et al.  Minimal cutset enumeration and network reliability evaluation by recursive merge and BDD , 2003, Proceedings of the Eighth IEEE Symposium on Computers and Communications. ISCC 2003.

[11]  Peter Buchholz,et al.  Hierarchical Markovian Models: Symmetries and Reduction , 1995, Perform. Evaluation.

[12]  Dale R. Fox Block cutpoint decomposition for markovian queueing systems , 1988 .

[13]  David Coppit,et al.  Developing a low-cost high-quality software tool for dynamic fault-tree analysis , 2000, IEEE Trans. Reliab..

[14]  David E. Culler,et al.  User-Centric Performance Analysis of Market-Based Cluster Batch Schedulers , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[15]  Abraham Boyarsky A matrix method for estimating the Liapunov exponent of one-dimensional systems , 1988 .

[16]  Shimon Even,et al.  Graph Algorithms , 1979 .

[17]  Christopher E. Dabrowski,et al.  Markov Chain Analysis for Large-Scale Grid Systems , 2009 .

[18]  Fern Y. Hunt,et al.  Using Markov chain analysis to study dynamic behaviour in large-scale grid systems , 2009, AusGrid '09.

[19]  Samuel Karlin,et al.  A First Course on Stochastic Processes , 1968 .

[20]  S. Grossmann,et al.  Invariant Distributions and Stationary Correlation Functions of One-Dimensional Discrete Processes , 1977 .

[21]  Perturbed Markov chains , 2003, Journal of Applied Probability.

[22]  F. Delebecque A Reduction Process for Perturbed Markov Chains , 1983 .

[23]  Norman D. Curet,et al.  An efficient network flow code for finding all minimum cost s-t cutsets , 2002, Comput. Oper. Res..

[24]  Xi-Ren Cao,et al.  Basic Ideas for Event-Based Optimization of Markov Systems , 2005, Discret. Event Dyn. Syst..

[25]  M. Benzi,et al.  A parallel solver for large-scale Markov chains , 2002 .

[26]  Carl D. Meyer,et al.  Stochastic Complementation, Uncoupling Markov Chains, and the Theory of Nearly Reducible Systems , 1989, SIAM Rev..

[27]  David R. Karger,et al.  A randomized fully polynomial time approximation scheme for the all terminal network reliability problem , 1995, STOC '95.

[28]  Jeffrey J. Hunter,et al.  Mathematical techniques of applied probability , 1985 .

[29]  Fern Y. Hunt,et al.  A Monte Carlo approach to the approximation of invariant measures , 1993 .

[30]  RAJAN Perturbation Analysis: The State of the Art and Research Issues Explained via the GI/G/l Queue , 2004 .

[31]  Enrico Macii,et al.  Markovian analysis of large finite state machines , 1996, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[32]  Christopher E. Dabrowski,et al.  Markov Chain Analysis for Large-Scale Grid Systems | NIST , 2009 .

[33]  Shuji Tsukiyama,et al.  An Algorithm to Enumerate All Cutsets of a Graph in Linear Time per Cutset , 1980, J. ACM.

[34]  R. Jensen,et al.  Statistical properties of the circle map , 1986 .

[35]  William H. Sanders,et al.  Model-based evaluation: from dependability to security , 2004, IEEE Transactions on Dependable and Secure Computing.

[36]  Xi-Ren Cao,et al.  Event-Based Optimization of Markov Systems , 2008, IEEE Transactions on Automatic Control.

[37]  René Boel,et al.  Discrete event dynamic systems: Theory and applications. , 2002 .

[38]  R. Kevin Wood,et al.  Enumerating Near-Min S-T Cuts , 2003 .

[39]  Nicolas Vieille,et al.  Approximating a sequence of observations by a simple process , 2002 .

[40]  Du Feng,et al.  New Smith Predictor and Nonlinear Control for Networked Control Systems , 2009 .

[41]  Rajkumar Buyya,et al.  Service Level Agreement based Allocation of Cluster Resources: Handling Penalty to Enhance Utility , 2005, 2005 IEEE International Conference on Cluster Computing.

[42]  T. L. Landers,et al.  A recursive approach for enumerating minimal cutsets in a network , 1994 .

[43]  D. R. Fulkerson,et al.  Flows in Networks. , 1964 .

[44]  Refael Hassin,et al.  Mean Passage Times and Nearly Uncoupled Markov Chains , 1992, SIAM J. Discret. Math..

[45]  Giovanni Chiola,et al.  Stochastic Well-Formed Colored Nets and Symmetric Modeling Applications , 1993, IEEE Trans. Computers.

[46]  Satish K. Tripathi,et al.  A framework for reliable routing in mobile ad hoc networks , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[47]  Andrew V. Goldberg,et al.  A new approach to the maximum flow problem , 1986, STOC '86.

[48]  A. Goldberg,et al.  A new approach to the maximum-flow problem , 1988, JACM.

[49]  B. Chandrasekaran,et al.  A FRAMEWORK FOR PLANNING MULTIPLE PATHS IN FREE SPACE , 2006 .

[50]  M. N. Jacobi,et al.  A dual eigenvector condition for strong lumpability of Markov chains , 2007, 0710.1986.

[51]  Michele Benzi Numerical Solution of Markov Chains , 2011, Numer. Linear Algebra Appl..

[52]  Y. Ho,et al.  Extensions of infinitesimal perturbation analysis , 1988 .

[53]  J. Dugan,et al.  Minimal cut set/sequence generation for dynamic fault trees , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[54]  David R. Karger,et al.  A new approach to the minimum cut problem , 1996, JACM.

[55]  Y. C. Ho,et al.  A survey of the perturbation analysis of discrete event dynamic systems , 1985 .

[56]  J. Scott Provan,et al.  Computing Network Reliability in Time Polynomial in the Number of Cuts , 1984, Oper. Res..

[57]  Christopher E. Dabrowski,et al.  Can Economics-based Resource Allocation Prove Effective in a Computation Marketplace? , 2008, Journal of Grid Computing.

[58]  Anna Gambin,et al.  Aggregation Algorithms for Perturbed Markov Chains with Applications to Networks Modeling , 2008, SIAM J. Sci. Comput..

[59]  John G. Kemeny,et al.  Finite Markov chains , 1960 .