Performability analysis of highly available clusters with break-downs and deferred repairs.

Fault-tolerant systems with repair-upon-failure strategy can become expensive in terms of labour and time. While postponing these repairs, it is essential to keep the whole system capable to deal with user requests. For this purpose, usually, a threshold value is defined which represents the minimum number of servers the system administrator should keep operative. Highly available multiprocessor systems with one head and several computation nodes is a common configuration in various cluster systems used as a low-cost alternative to supercomputers. It is typical to introduce a redundant head for such systems to improve availability. Deferred repairs can be used for such systems for reducing repair costs when no permanent repair facility exists on premises. Performability evaluation of such systems is very important since the systems are fault tolerant. In this paper, the performance modelling for highly available multiprocessor systems is presented. For these systems, one main and several identical computing nodes serving the same stream of arriving jobs is considered. To improve the availability of the system, the head node is backed-up. To account for delays due to switching of head node, such systems are modelled and solved for exact performability measures for both bounded and unbounded queuing systems assuming a deferred repair strategy.

[1]  Samuel T. Chanson,et al.  Performance Models for the Processor Farm Paradigm , 1997, IEEE Trans. Parallel Distributed Syst..

[2]  Enver Ever,et al.  A mathematical model for highly available clusters with one head and several identical computing nodes. , 2006 .

[3]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[4]  Joel C. Adams,et al.  Small-college supercomputing: building a Beowulf cluster at a comprehensive college , 2002, SIGCSE '02.

[5]  Dong Tang,et al.  Optimizing service strategy for systems with deferred repair , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[6]  Tong Liu,et al.  Highly Reliable Linux HPC Clusters: Self-Awareness Approach , 2004, ISPA.

[7]  Thomas J. Hacker,et al.  A Methodology for Account Management in Grid Computing Environments , 2001, GRID.

[8]  Juan A. Carrasco,et al.  Transient analysis of Markov models of fault‐tolerant systems with deferred repair using split regenerative randomization , 2006 .

[9]  Christian Engelmann,et al.  Concepts for High Availability in Scientific High-End Computing , 2005 .

[10]  Juan A. Carrasco Adapted importance sampling schemes for the simulation of dependability models of fault-tolerant systems with deferred repair , 2006, 39th Annual Simulation Symposium (ANSS'06).

[11]  Ram Chakka,et al.  Modelling multiserver systems with time or operation dependent breakdowns, alternate repair strategies, reconfiguration and rebooting delays , 2002 .

[12]  John A. Buzacott,et al.  Stochastic models of manufacturing systems , 1993 .

[13]  Chita R. Das,et al.  Coscheduling in Clusters: Is It a Viable Alternative? , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[14]  Dieter Fiems,et al.  Discrete-time queues with generally distributed service times and renewal-type server interruptions , 2004, Perform. Evaluation.

[15]  Enver Ever,et al.  A mathematical model for performability of Beowulf clusters , 2006, 39th Annual Simulation Symposium (ANSS'06).

[16]  Ram Chakka,et al.  Spectral expansion solution for some finite capacity queues , 1998, Ann. Oper. Res..

[17]  Ram Chakka,et al.  Heterogeneous Multiprocessor Systems with Breakdowns: Performance and Optimal Repair Strategies , 1994, Theor. Comput. Sci..

[18]  C. Leangsuksun,et al.  Asymmetric Active-Active High Availability for High-end Computing , 2005 .