论文信息 - Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Cluster-based servers can substantially increase performance when nodes cooperate to globally manage resources. However, in this paper we show that cooperation results in a substantial availability loss, in the absence of high-availability mechanisms. Specifically, we show that a sophisticated cluster-based Web server, which gains a factor of 3 in performance through cooperation, increases service unavailability by a factor of 10 over a non-cooperative version. We then show how to augment this Web server with software components embodying a small set of high-availability techniques to regain the lost availability. Among other interesting observations, we show that the application of multiple high-availability techniques, each implemented independently in its own subsystem, can lead to inconsistent recovery actions. We also show that a novel technique called Fault Model Enforcement can be used to resolve such inconsistencies. Augmenting the server with these techniques led to a final expected availability of close to 99.99%.

[1] Darrell D. E. Long,et al. A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[2] David E. Culler,et al. Distributed data structures for internet service construction , 2000, USENIX Symposium on Operating Systems Design and Implementation.

[3] Ricardo Bianchini,et al. Efficiency vs. portability in cluster-based network servers , 2001, PPoPP '01.

[4] Erich M. Nahum,et al. Locality-aware request distribution in cluster-based network servers , 1998, ASPLOS VIII.

[5] Richard P. Martin,et al. Mendosis: A SAN-based Fault Injection Test-bed for the Construction of Highly Available Network Services , 2001 .

[6] H KatzRandy,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[7] David E. Culler,et al. Scalable, distributed data structures for internet service construction , 2000, OSDI.

[8] Randy H. Katz,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[9] Henrique Madeira,et al. Joint evaluation of performance and robustness of a COTS DBMS through fault-injection , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[10] Ravishankar K. Iyer,et al. Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[11] Ricardo Bianchini,et al. Analytical and experimental evaluation of cluster-based network servers , 2000, World Wide Web.

[12] Eric A. Brewer,et al. Harvest, yield, and scalable tolerant systems , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[13] David A. Patterson,et al. Reducing the cost of system administration of a disk storage system built from commodity components , 2000 .

[14] Mark Sullivan,et al. Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[15] Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 15-21 November 2003, Phoenix, AZ, USA, CD-Rom , 2003 .

[16] Richard P. Martin,et al. Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[17] Ravishankar K. Iyer,et al. An approach towards benchmarking of fault-tolerant commercial systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[18] David E. Culler,et al. SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[19] Eric A. Brewer,et al. Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[20] Ravishankar K. Iyer,et al. Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..

[21] Eric A. Brewer,et al. Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[22] Ravishankar K. Iyer,et al. Failure data analysis of a LAN of Windows NT based computers , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[23] Kishor S. Trivedi,et al. An approach for estimation of software aging in a Web server , 2002, Proceedings International Symposium on Empirical Software Engineering.

[24] Brendan Murphy,et al. Windows 2000 Dependability , 2000 .

[25] Liviu Iftode,et al. User-level communication in cluster-based servers , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[26] Richard P. Martin,et al. Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Services , 2003, USENIX Symposium on Internet Technologies and Systems.

[27] Jim Gray,et al. A census of Tandem system availability between 1985 and 1990 , 1990 .

[28] Ravishankar K. Iyer,et al. Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment , 2000, IEEE Trans. Knowl. Data Eng..

[29] Thu D. Nguyen,et al. Us-ing Fault Model Enforcement to Improve Availability , 2002 .

[30] Tarek El-Ghazawi,et al. Redundant array of inexpensive disks (RAID) , 2003 .

[31] Frank B. Schmuck,et al. Agreeing on Processor Group Membership in Timed Asynchronous Distributed Systems , 1995 .

[32] M LevyHenry,et al. Manageability, availability and performance in Porcupine , 1999 .