Availability modeling and analysis on high performance cluster computing systems

Cluster computing has been attracting more and more attention from both the industry and the academia for its enormous computing power, cost effectiveness, and scalability. Availability is a key system attribute that needs to be considered both at system design stage and must reflect the actuality. System monitoring and logging enables identifying unplanned events to reflect the actual system's availability. This paper proposes a single framework that coordinates event monitoring, filtering, data analysis and dynamic availability modeling. The availability model is abstracted and categorized based on functionality. We describe the proposed architecture, and a sample analysis of real time event logs from a 512 node cluster from Lawrence Livermore National Laboratory.

[1]  J. Abraham An Improved Algorithm for Network Reliability , 1979, IEEE Transactions on Reliability.

[2]  Marius Iosifescu,et al.  Finite Markov Processes and Their Applications , 1981 .

[3]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[4]  Kishor S. Trivedi,et al.  Extended Stochastic Petri Nets: Applications and Analysis , 1984, Performance.

[5]  Donald Gross,et al.  The Randomization Technique as a Modeling Tool and Solution Procedure for Transient Markov Processes , 1984, Oper. Res..

[6]  Kishor S. Trivedi,et al.  A Hierarchical, Combinatorial-Markov Method of Solving Complex Reliability Models , 1986, FJCC.

[7]  Edmundo de Souza e Silva,et al.  Calculating Cumulative Operational Time Distributions of Repairable Computer Systems , 1986, IEEE Transactions on Computers.

[8]  Miroslaw Malek,et al.  Survey of software tools for evaluating reliability, availability, and serviceability , 1988, CSUR.

[9]  Peter W. Glynn,et al.  Computing Poisson probabilities , 1988, CACM.

[10]  Kishor S. Trivedi,et al.  Reliability analysis of interconnection networks using hierarchical composition , 1989 .

[11]  Kishor S. Trivedi,et al.  Approximate availability analysis of VAXcluster systems , 1989 .

[12]  M. Condon,et al.  System availability monitoring , 1990 .

[13]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[14]  A. Sathaye,et al.  Validating complex computer system availability models , 1990 .

[15]  Kishor S. Trivedi,et al.  An improved algorithm for symbolic reliability analysis , 1991 .

[16]  Kishor S. Trivedi,et al.  Specification and generation of Markov reward models , 1992 .

[17]  Salvatore J. Bavuso,et al.  Dynamic fault-tree models for fault-tolerant computer systems , 1992 .

[18]  Kishor S. Trivedi,et al.  Automated Generation and Analysis of Markov Reward Models Using Stochastic Reward Nets , 1993 .

[19]  Kishor S. Trivedi,et al.  Reliability and Performability Techniques and Tools: A Survey , 1993, MMB.

[20]  Ravishankar K. Iyer,et al.  Dependability Measurement and Modeling of a Multicomputer System , 1993, IEEE Trans. Computers.

[21]  Boudewijn R. Haverkort,et al.  Specification techniques for Markov reward models , 1993, Discret. Event Dyn. Syst..

[22]  William J. Stewart,et al.  Introduction to the numerical solution of Markov Chains , 1994 .

[23]  Kishor S. Trivedi,et al.  Numerical methods for reliability evaluation of Markov closed fault-tolerant systems , 1995, IEEE Transactions on Reliability.

[24]  Thomas L. Sterling,et al.  BEOWULF: A Parallel Workstation for Scientific Computation , 1995, ICPP.

[25]  Kishor S. Trivedi,et al.  A survey of efficient reliability computation using disjoint products approach , 1995, Networks.

[26]  Kishor S. Trivedi,et al.  Dependability modeling using Petri-nets , 1995 .

[27]  Ram Chillarege,et al.  Measurement of failure rate in widely distributed software , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[28]  Kishor S. Trivedi,et al.  Markov Dependability Models of Complex Systems: Analysis Techniques , 1996 .

[29]  Kishor S. Trivedi,et al.  Performance and Reliability Analysis of Computer Systems , 1996, Springer US.

[30]  Kishor S. Trivedi,et al.  The Evolution of Stochastic Petri Nets , 1997 .

[31]  Kishor S. Trivedi,et al.  Performance And Reliability Analysis Of Computer Systems (an Example-based Approach Using The Sharpe Software , 1997, IEEE Transactions on Reliability.

[32]  Boudewijn R. Haverkort,et al.  Performance and reliability analysis of computer systems: An example-based approach using the sharpe software package , 1998 .

[33]  Richard J. Boucherie,et al.  Uniformization for λ-positive Markov chains , 1998 .

[34]  Kishor S. Trivedi,et al.  An improved algorithm for coherent-system reliability , 1998 .

[35]  Ivar Jacobson,et al.  The Unified Modeling Language User Guide , 1998, J. Database Manag..

[36]  宮沢 政清,et al.  P. Bremaud 著, Markov Chains, (Gibbs fields, Monte Carlo simulation and Queues), Springer-Verlag, 1999年 , 2000 .

[37]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[38]  István Majzik,et al.  Modeling and analysis of redundancy management in distributed object-oriented systems by using UML statecharts , 2001, Proceedings 27th EUROMICRO Conference. 2001: A Net Odyssey.

[39]  Betty H. C. Cheng,et al.  A general framework for formalizing UML with formal languages , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[40]  Stefan Müller,et al.  Diagram Interchange for UML , 2002, UML.

[41]  Andrea Bondavalli,et al.  Stochastic Dependability Analysis of System Architecture Based on UML Models , 2002, WADS.

[42]  Joanne Bechta Dugan,et al.  Automatic synthesis of dynamic fault trees from UML system models , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[43]  Dragan Milicev,et al.  Automatic Model Transformations Using Extended UML Object Diagrams in Modeling Environments , 2002, IEEE Trans. Software Eng..

[44]  Stephen Taylor,et al.  Reliable heterogeneous applications , 2003, IEEE Trans. Reliab..

[45]  Tong Liu,et al.  Availability prediction and modeling of high mobility OSCAR cluster , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[46]  Hany H. Ammar,et al.  Architectural-Level Risk Analysis Using UML , 2003, IEEE Trans. Software Eng..

[47]  Liang Yin,et al.  Hierarchical composition and aggregation of state-based availability and performability models , 2003, IEEE Trans. Reliab..

[48]  Lorenzo Traldi,et al.  Preprocessing minpaths for sum of disjoint products , 2003, IEEE Trans. Reliab..

[49]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[50]  Stephen L. Scott,et al.  Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off , 2005, 2005 IEEE International Conference on Cluster Computing.

[51]  Raja Nassar,et al.  A light-weight solution for large sparse Markov processes , 2005, ACM-SE 43.

[52]  Raja Nassar,et al.  OOMSE-An Object Oriented Markov Chain Specification and Evaluation Framework , 2005, SEKE.