Cost Reduction in High Power Computing Using a Deferred Repair Strategy: A Simulation Study

Fault-tolerant systems with repair-upon-failure strategy can become expensive in terms of labour and time. Especially for homogeneous multi-server systems, if no control hierarchy exists, postponing non essential repairs can reduce these costs without affecting the availability of the whole system significantly. Of course, while postponing these repairs, it is essential to keep the whole system capable to deal with user requests. For this purpose, usually, a threshold value is defined which represents the minimum number of servers the system administrator should keep operative. Performability evaluation of such systems is very important since the systems are fault tolerant. In this paper, the simulation of large scale multi-server systems, with identical servers, serving a stream of arriving jobs is considered. The cost of running such systems with various deferred repair strategies has been calculated and compared to the cost of using a repair-upon failure strategy.

[1]  Thomas J. Hacker,et al.  A Methodology for Account Management in Grid Computing Environments , 2001, GRID.

[2]  John A. Buzacott,et al.  Stochastic models of manufacturing systems , 1993 .

[3]  Juan A. Carrasco,et al.  Transient analysis of Markov models of fault‐tolerant systems with deferred repair using split regenerative randomization , 2006 .

[4]  Michael S. Floyd,et al.  Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology , 2002, IBM J. Res. Dev..

[5]  Dong Tang,et al.  Optimizing service strategy for systems with deferred repair , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[6]  Kishor S. Trivedi,et al.  Hierarchical computation of interval availability and related metrics , 2004, International Conference on Dependable Systems and Networks, 2004.

[7]  Juan A. Carrasco Adapted importance sampling schemes for the simulation of dependability models of fault-tolerant systems with deferred repair , 2006, 39th Annual Simulation Symposium (ANSS'06).

[8]  Dieter Fiems,et al.  Discrete-time queues with generally distributed service times and renewal-type server interruptions , 2004, Perform. Evaluation.

[9]  Enver Ever,et al.  Mathematical modelling for performability analysis of homogeneous multi-server systems with deferred repairs. , 2006 .

[10]  Chita R. Das,et al.  Coscheduling in Clusters: Is It a Viable Alternative? , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[11]  Ram Chakka,et al.  Heterogeneous Multiprocessor Systems with Breakdowns: Performance and Optimal Repair Strategies , 1994, Theor. Comput. Sci..

[12]  Ram Chakka,et al.  Modelling multiserver systems with time or operation dependent breakdowns, alternate repair strategies, reconfiguration and rebooting delays , 2002 .

[13]  Enver Ever,et al.  Performability analysis of highly available clusters with break-downs and deferred repairs. , 2006 .

[14]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .