Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Cluster-based servers can substantially increase performance when nodes cooperate to globally manage resources. However, in this paper we show that cooperation results in a substantial availability loss, in the absence of high-availability mechanisms. Specifically, we show that a sophisticated cluster-based Web server, which gains a factor of 3 in performance through cooperation, increases service unavailability by a factor of 10 over a non-cooperative version. We then show how to augment this Web server with software components embodying a small set of high-availability techniques to regain the lost availability. Among other interesting observations, we show that the application of multiple high-availability techniques, each implemented independently in its own subsystem, can lead to inconsistent recovery actions. We also show that a novel technique called Fault Model Enforcement can be used to resolve such inconsistencies. Augmenting the server with these techniques led to a final expected availability of close to 99.99%.

[1]  Darrell D. E. Long,et al.  A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[2]  David E. Culler,et al.  Distributed data structures for internet service construction , 2000, USENIX Symposium on Operating Systems Design and Implementation.

[3]  Ricardo Bianchini,et al.  Efficiency vs. portability in cluster-based network servers , 2001, PPoPP '01.

[4]  Erich M. Nahum,et al.  Locality-aware request distribution in cluster-based network servers , 1998, ASPLOS VIII.

[5]  Richard P. Martin,et al.  Mendosis: A SAN-based Fault Injection Test-bed for the Construction of Highly Available Network Services , 2001 .

[6]  H KatzRandy,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[7]  David E. Culler,et al.  Scalable, distributed data structures for internet service construction , 2000, OSDI.

[8]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[9]  Henrique Madeira,et al.  Joint evaluation of performance and robustness of a COTS DBMS through fault-injection , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[10]  Ravishankar K. Iyer,et al.  Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[11]  Ricardo Bianchini,et al.  Analytical and experimental evaluation of cluster-based network servers , 2000, World Wide Web.

[12]  Eric A. Brewer,et al.  Harvest, yield, and scalable tolerant systems , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[13]  David A. Patterson,et al.  Reducing the cost of system administration of a disk storage system built from commodity components , 2000 .

[14]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[15]  Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 15-21 November 2003, Phoenix, AZ, USA, CD-Rom , 2003 .

[16]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[17]  Ravishankar K. Iyer,et al.  An approach towards benchmarking of fault-tolerant commercial systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[18]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[19]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[20]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..

[21]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[22]  Ravishankar K. Iyer,et al.  Failure data analysis of a LAN of Windows NT based computers , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[23]  Kishor S. Trivedi,et al.  An approach for estimation of software aging in a Web server , 2002, Proceedings International Symposium on Empirical Software Engineering.

[24]  Brendan Murphy,et al.  Windows 2000 Dependability , 2000 .

[25]  Liviu Iftode,et al.  User-level communication in cluster-based servers , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[26]  Richard P. Martin,et al.  Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Services , 2003, USENIX Symposium on Internet Technologies and Systems.

[27]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[28]  Ravishankar K. Iyer,et al.  Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment , 2000, IEEE Trans. Knowl. Data Eng..

[29]  Thu D. Nguyen,et al.  Us-ing Fault Model Enforcement to Improve Availability , 2002 .

[30]  Tarek El-Ghazawi,et al.  Redundant array of inexpensive disks (RAID) , 2003 .

[31]  Frank B. Schmuck,et al.  Agreeing on Processor Group Membership in Timed Asynchronous Distributed Systems , 1995 .

[32]  M LevyHenry,et al.  Manageability, availability and performance in Porcupine , 1999 .