Reliability and performance of tree-structured grid services

Grid computing is a new emerging technology aiming at large-scale resource sharing, and global-area collaboration. It is a next step in an evolution of parallel and distributed computing. Due to the large scale and complexity of the grid system, its performance and reliability are difficult to model, analyse, and evaluate. This paper presents a model that relaxes some assumptions unsuitable for grid computing systems that have been made in the existed works studying the distributed systems. The paper proposes a virtual tree model of the grid service. This model simplifies the physical structure of a grid service, allows service performance (execution time) to be estimated, and takes into account the common cause failures in communication channels. Based on the model, an algorithm for evaluating the grid service performance distribution and the service reliability indices is suggested. The algorithm is based on graph theory, and Bayesian analysis. Illustrative examples are presented in which the results of the suggested algorithm are compared with simulation results.

[1]  Deng-Jyi Chen,et al.  The distributed program reliability analysis on star topologies , 2000, Comput. Oper. Res..

[2]  Rajkumar Buyya,et al.  A taxonomy and survey of grid resource management systems for distributed computing , 2002, Softw. Pract. Exp..

[3]  Xiaolin Teng,et al.  A software-reliability growth model for N-version programming systems , 2002, IEEE Trans. Reliab..

[4]  Min Xie,et al.  A study of operational and testing reliability in software reliability analysis , 2000, Reliab. Eng. Syst. Saf..

[5]  Viktor K. Prasanna,et al.  Distributed program reliability analysis , 1986, IEEE Transactions on Software Engineering.

[6]  Sajal K. Das,et al.  Parallel processing of adaptive meshes with load balancing , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[7]  Srinivasan Keshav,et al.  An Engineering Approach to Computer Networking: ATM Networks , 1996 .

[8]  Michael Tortorella,et al.  Service Reliability Theory and Engineering, I: Foundations , 2005 .

[9]  John F. Meyer,et al.  On Evaluating the Performability of Degradable Computing Systems , 1980, IEEE Transactions on Computers.

[10]  Szu Hui Ng,et al.  A model for correlated failures in N-version programming , 2004 .

[11]  Yuan-Shun Dai,et al.  Reliability analysis of grid computing systems , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[12]  Yuan-Shun Dai,et al.  Modeling and analysis of correlated software failures of multiple types , 2005, IEEE Trans. Reliab..

[13]  Yuan-Shun Dai,et al.  A study of service reliability and availability for distributed systems , 2003, Reliab. Eng. Syst. Saf..

[14]  Ming Sang Chang The Distributed Program Reliability Analysis on a Star Topology : Efficient Algorithms and Approximate Solution , 1999 .

[15]  Michael Tortorella Service Reliability Theory and Engineering, II: Models and Examples , 2005 .

[16]  M. J. Quinn,et al.  Parallel Computing: Theory and Practice , 1994 .

[17]  Anurag Kumar Adaptive load control of the central processor in a distributed system with a star topology , 1986, 1986 25th IEEE Conference on Decision and Control.

[18]  Deng-Jyi Chen,et al.  Reliability Analysis of Distributed Systems Based on a Fast Reliability Algorithm , 1992, IEEE Trans. Parallel Distributed Syst..

[19]  Jarek Nabrzyski,et al.  Grid Resource Management , 2004 .

[20]  Nancy G. Leveson,et al.  An experimental evaluation of the assumption of independence in multiversion programming , 1986, IEEE Transactions on Software Engineering.

[21]  Ruey-Shun Chen,et al.  A heuristic approach to generating file spanning trees for reliability analysis of distributed computing systems , 1997 .

[22]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[23]  Ratul Mahajan,et al.  Controlling high bandwidth aggregates in the network , 2002, CCRV.

[24]  Akhil Kumar An Efficient SuperGrid Protocol for High Availability and Load Balancing , 2000, IEEE Trans. Computers.

[25]  Paul H. Kvam,et al.  Common cause failure prediction using data mapping , 2002, Reliab. Eng. Syst. Saf..

[26]  R. V. van Nieuwpoort,et al.  The Grid 2: Blueprint for a New Computing Infrastructure , 2003 .

[27]  Ann T. Tai,et al.  Performability enhancement of fault-tolerant software , 1993 .

[28]  Deng-Jyi Chen,et al.  The distributed program reliability analysis on ring-type topologies , 2001, Comput. Oper. Res..

[29]  Giuseppe Iazeolla,et al.  Performability evaluation of multicomponent fault-tolerant systems , 1988 .

[30]  Ian T. Foster,et al.  Grid Services for Distributed System Integration , 2002, Computer.

[31]  Yuan-Shun Dai,et al.  A model for availability analysis of distributed software/hardware systems , 2002, Inf. Softw. Technol..

[32]  Francine Berman,et al.  Adaptive Computing on the Grid Using AppLeS , 2003, IEEE Trans. Parallel Distributed Syst..