Redundancy Schemes for High Availability Computer Clusters

The primary goal of computer clusters is to improve computing performances by taking advantage of the parallelism they intrinsically provide. Moreover, their use of redundant hardware components enables them to offer high availability services. In this paper, we present an analytical model for analyzing redundancy schemes and their impact on the cluster’s overall performance. Furthermore, several cluster redundancy techniques are analyzed with an emphasis on hardware and data redundancy, from which we derive an applicable redundancy scheme design. Also, our solution provides a disaster recovery mechanism that improves the cluster’s availability. In the case of data redundancy, we present improvements to the replication and parity data replication techniques for which we investigate the availability of the cluster under several scenarios that take into account, among other things, the number of replicated nodes, the number of CPUs that hold parity data and the relation between primary and replicated data. For this purpose, we developed a simulator that analyzes the impact of a redundancy scheme on the processing rate of the cluster. We also studied the performance of two well-known schemes according to the usage rate of the CPUs. We found that two important aspects influencing the performance of a transaction-oriented cluster were the cluster’s failover and data redundancy schemes. We simulated several data redundancy schemes and found that data replication offered higher cluster availability than the parity model.

[1]  Mustafa Mat Deris,et al.  High service reliability for cluster server systems , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[2]  R. Abielmona,et al.  Performance evaluation of a software cluster , 2002, IMTC/2002. Proceedings of the 19th IEEE Instrumentation and Measurement Technology Conference (IEEE Cat. No.00CH37276).

[3]  Farnam Jahanian,et al.  A Real-Time Primary-Backup Replication Service , 1999, IEEE Trans. Parallel Distributed Syst..

[4]  S. C. Wood,et al.  Systems of multiple cluster tools: configuration, reliability, and performance , 2003 .

[5]  B. Bhargava,et al.  Measurements and quality of service issues in electronic commerce software , 1999, Proceedings 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology. ASSET'99 (Cat. No.PR00122).

[6]  Henrique Madeira,et al.  Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..

[7]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[8]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[9]  Tom W. Keller,et al.  A comparison of high-availability media recovery techniques , 1989, SIGMOD '89.

[10]  David J. DeWitt,et al.  Chained declustering: a new availability strategy for multiprocessor database machines , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[11]  Tao Yang,et al.  Clustering Support and Replication Management for Scalable Network Services , 2003, IEEE Trans. Parallel Distributed Syst..

[12]  S. Balasmo,et al.  Closed queueing networks with finite capacities: blocking types, product-form solution and performance indices , 1991 .

[13]  Yong Chen,et al.  CoStore: a reliable and highly available storage system using clusters , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[14]  Rajkumar Buyya Single System Image: Need, Approaches, and Supporting HPC Systems , 1997, PDPTA.

[15]  Simonetta Balsamo Closed queueing networks with finite capacity queues: approximate analysis , 2000, ESM.

[16]  Rajkumar Buyya,et al.  High Performance Cluster Computing: Architectures and Systems , 1999 .

[17]  P. Altena,et al.  In search of clusters , 2007 .

[18]  Dan Gordon The Floating Column Algorithm for Shaded, Parallel Display of Function Surfaces without Patches , 2002, IEEE Trans. Vis. Comput. Graph..

[19]  Willie Chang A resource efficient scheme for network service recovery in a cluster , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[20]  Gregory F. Pfister,et al.  In Search of Clusters , 1995 .

[21]  G. A. Alvarez,et al.  Tolerating Multiple Failures In Raid Architectures With Optimal Storage And Uniform Declustering , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[22]  Peter S. Weygant,et al.  Clusters for High Availability: A Primer of HP Solutions , 1996 .

[23]  Michael K. Molloy Fundamentals of Performance Modeling , 1990, SIGMETRICS Perform. Evaluation Rev..

[24]  Edmundo de Souza e Silva,et al.  Performability Analysis of Computer Systems: From Model Spacification to Solution , 1992, Perform. Evaluation.

[25]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[26]  Simonetta Balsamo,et al.  Closed Queueing Networks with Finite Capacities: Blocking Types, Product-Form Solution and Performance Indices , 1991, Perform. Evaluation.

[27]  Randy H. Katz,et al.  An evaluation of redundant arrays of disks using an Amdahl 5890 , 1990, SIGMETRICS '90.