Software exploitation of a fault-tolerant computer with a large memory

The DM/6000 hardware (a prototype, fault-tolerant RS/6000 built at the T.J. Watson Research Center) provides fault tolerance and a large, nonvolatile main memory. Running a commercial, general-purpose operating system on it, of itself, does nothing to increase software availability. In fact, the time to rebuild the contents of a large memory may decrease availability. We describe our techniques for hiding most of the main memory, which requires the operating system to access it only by way of services separate from the operating system. This can allow the memory and those access services to achieve much higher availability, which, in turn, increases the availability of the system as a whole. We also performed simulation studies to determine those conditions where this system organization can lead to improved performance for recoverable database applications.

[1]  Willy Zwaenepoel,et al.  eNVy: a NonVolatile main memory storage system , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[2]  Janice M. Stone A simple and correct shared-queue algorithm using compare-and-swap , 1990, Proceedings SUPERCOMPUTING '90.

[3]  Ravi Krishnamurthy,et al.  The Case For Safe RAM , 1989, VLDB.

[4]  S. G. Tucker,et al.  The IBM 3090 System: An Overview , 1986, IBM Syst. J..

[5]  R. Jason Martin Transaction Processing Facility: A Guide for Application Programmers , 1990 .

[6]  Martin M. Bradley Understanding The S/390 Parallel Sysplex: A Technical Introduction , 1994, Int. CMG Conference.

[7]  Mary Baker,et al.  Non-volatile memory for fast, reliable file systems , 1992, ASPLOS V.

[8]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[9]  Peter M. Chen,et al.  The Rio file cache: surviving operating system crashes , 1996, ASPLOS VII.

[10]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[11]  R. Brett Tremaine,et al.  Durable memory RS/6000 system design , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[12]  Michael Wu,et al.  eNVy: a non-volatile, main memory storage system , 1994, ASPLOS VI.

[13]  Peter M. Chen,et al.  Integrating reliable memory in databases , 1998, The VLDB Journal.

[14]  Richard A. Meyer,et al.  A Virtual Machine Time-Sharing System , 1970, IBM Syst. J..

[15]  Arun Chandra,et al.  Evaluating HACMP/6000: a clustering solution for high availability distributed systems , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[16]  Jason Gait Phoenix: a safe in-memory file system , 1990, CACM.

[17]  Kai Li,et al.  Evaluation of memory system extensions , 1991, ISCA '91.

[18]  G. TuckerS. The IBM 3090 system , 1986 .