Fast cluster failover using virtual memory-mapped communication

This paper proposes a novel way to use virtual memorymapped communication (VMMC) to reduce the failover time on clusters. With the VMMC model, applications’ virtual address space can be efficiently mirrored on remote memory either automatically or via explicit messages. When a machine fails, its applications can restart from the most recent checkpoints on the failover node with minimal memory copying and disk I/O overhead. This method requires little change to applications’ source code. We developed two fast failover protocols: deliberate update failover protocol (DU) and automatic update failover protoco2 (AU). The first can run on any system that supports VMMC, whereas the other requires special network interface support. We implemented these two protocols on two different clusters that supported VMMC communication. Our results with three transaction-based applications show that both protocols work quite well. The deliberate update protocol imposes 4-21% overhead when taking checkpoints every 2 seconds. If an application can tolerate 20% overhead, this protocol can failover to another machine within 4 milliseconds in the best case and from 0.1 to 3 seconds in the worst case. The failover performance can be further improved by using special network interface hardware. The automatic update protocol is able to take checkpoints every 0.1 seconds with only 3-12% overhead. If 10% overhead is allowed, it can failover applications from 0.01 to 0.4 seconds in the worst case.

[1]  Peter M. Chen,et al.  Free transactions with Rio Vista , 1997, SOSP.

[2]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[3]  Peter M. Chen,et al.  Discount Checking: Transparent, Low-Overhead Recovery for General Applications , 1998 .

[4]  Ira Krepchin,et al.  Tandem Computers Inc. , 1993 .

[5]  Kenneth P. Birman,et al.  The design and architecture of the Microsoft Cluster Service-a practical approach to high-availability and scalability , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[6]  Peter M. Chen,et al.  The Rio file cache: surviving operating system crashes , 1996, ASPLOS VII.

[7]  Michael Stonebraker,et al.  The Postgres DBMS , 1990, SIGMOD Conference.

[8]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[9]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[10]  Wayne M. Cardoza,et al.  Design of the TruCluster Multicomputer System for the Digital UNIX Environment , 1996, Digit. Tech. J..

[11]  G.S. Delp,et al.  Memory as a network abstraction , 1991, IEEE Network.

[12]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[13]  Greg Minshall,et al.  An Overview of the NetWare Operating System , 1994, USENIX Winter.

[14]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[15]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[16]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[17]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[18]  R. Gillett,et al.  Overview of memory channel network for PCI , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[19]  Yi-Min Wang,et al.  Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[20]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[21]  Michael Wu,et al.  eNVy: a non-volatile, main memory storage system , 1994, ASPLOS VI.

[22]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[23]  Liviu Iftode,et al.  Design choices in the SHRIMP system: an empirical study , 1998, ISCA.

[24]  Hector Garcia-Molina,et al.  Checkpointing memory-resident databases , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[25]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[26]  Kai Li,et al.  Virtual memory mapped network interface for the SHRIMP multicomputer , 1998, ISCA '98.

[27]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[28]  Liviu Iftode,et al.  Software support for virtual memory-mapped communication , 1996, Proceedings of International Conference on Parallel Processing.

[29]  Evangelos P. Markatos,et al.  Lightweight transactions on networks of workstations , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[30]  Willy Zwaenepoel,et al.  Recovery in distributed systems using asynchronous message logging and checkpointing , 1988, PODC '88.

[31]  Werner Vogels,et al.  The Design and Architecture of the Microsoft Cluster Service , 1998 .

[32]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[33]  E. N. Elnozahy,et al.  Supporting nondeterministic execution in fault-tolerant systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[34]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[35]  Kai Li,et al.  UTLB: a mechanism for address translation on network interfaces , 1998, ASPLOS VIII.

[36]  Daniel P. Siewiorek,et al.  Challenges in Designing Fault-Tolerant Systems , 1991, FTCS.

[37]  Katherine Guo,et al.  Scalability of the microsoft cluster service , 1998 .

[38]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.