Using lightweight checkpoint/recovery to improve the availability and designability of shared memory multiprocessors

To address downward trends in availability and designability, we propose using a lightweight checkpoint/recovery scheme called SafetyNet. SafetyNet is a hardware-only scheme that allows a shared memory multiprocessor to recover its system-wide state—including processor registers, caches, and memories—to a previous checkpoint. Thus, in the case of an error due to a device fault or a design fault, SafetyNet allows the system to recover to a pre-error state and re-execute. SafetyNet has three distinguishing features that enable it to provide error-free performance that is statistically equivalent to that of an unprotected system. First, it coordinates the system-wide checkpoints in logical time and leverages “logically atomic” cache coherence transactions. Second, SafetyNet uses an optimized logging scheme to reduce the amount of checkpoint state. Third, it pipelines checkpoint validation—the process of determining that a checkpoint is error-free and can be made the new recovery point—and keeps it entirely in the background. We demonstrate that SafetyNet can be used in conjunction with a variety of existing, error detection schemes to improve system availability. We also use SafetyNet to innovate in the areas of availability and designability. To improve availability, we leverage SafetyNet 's ability to tolerate long error detection latencies. SafetyNet can tolerate latencies that are long enough to enable much stronger error detection techniques than are currently feasible. These techniques can use inter-node communication and system-wide invariant checking. To improve designability, we use SafetyNet to enable speculatively correct designs, as well as to certain classes of unintentional design faults. For rare and complicated system events, we demonstrate that we can fall back on SafetyNet (and treat these events as errors) instead of devoting design time and verification effort towards handling them. We evaluate SafetyNet with full-system simulation and commercial workloads. Our results show that SafetyNet has negligible impact on error-free performance, while avoiding data corruption and system crashes when errors occur. We show that SafetyNet can provide this error recovery with reasonable storage costs and with negligible additional cache bandwidth.

[1]  Milo M. K. Martin,et al.  Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[2]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[3]  Marc Tremblay,et al.  High-Performance Fault-Tolerant VLSI Systems Using Micro Rollback , 1990, IEEE Trans. Computers.

[4]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[5]  Flaviu Cristian,et al.  A timestamp-based checkpointing protocol for long-lived distributed computations , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[6]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[7]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[8]  Phillip B. Gibbons,et al.  Testing Shared Memories , 1997, SIAM J. Comput..

[9]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[10]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[11]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[12]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[13]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[14]  Omri Serlin Fault-Tolerant Systems in Commercial Applications , 1984, Computer.

[15]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[16]  Milo M. K. Martin,et al.  Fast Checkpoint/Recovery to Support Kilo-Instruction Speculation and Hardware Fault Tolerance , 2000 .

[17]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[19]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[20]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[21]  Kourosh Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, SIGP.

[22]  T. N. Vijaykumar,et al.  Is SC+ILP=RC? , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[23]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[24]  Josep Torrellas,et al.  Architectural support for scalable speculative parallelization in shared-memory multiprocessors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  R. N. Gustafson,et al.  IBM 3081 Processor Unit: Design Considerations and Design Process , 1982, IBM J. Res. Dev..

[26]  Mikko H. Lipasti,et al.  Dynamic Verification of Cache Coherence , 2022 .

[27]  Jeffrey S. Chase,et al.  Integrating coherency and recoverability in distributed systems , 1994, OSDI '94.

[28]  Edmund M. Clarke,et al.  Formal Methods: State of the Art and Future Directions Working Group Members , 1996 .

[29]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[30]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[31]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[32]  Dave Johnson,et al.  The Intel 432: A VLSI Architecture for Fault-Tolerant Computer Systems , 1984, Computer.

[33]  Kewal K. Saluja,et al.  A study of time-redundant fault tolerance techniques for high-performance pipelined computers , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[34]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[35]  Mike Galles Spider: a high-speed network interconnect , 1997, IEEE Micro.

[36]  Milo M. K. Martin,et al.  Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol , 2002, IEEE Trans. Parallel Distributed Syst..

[37]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[38]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[39]  D. Jewett,et al.  Integrity S2: A Fault-Tolerant Unix Platform , 1991, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[40]  Milo M. K. Martin,et al.  Timestamp snooping: an approach for extending SMPs , 2000, ASPLOS.

[41]  Rajiv Gupta The fuzzy barrier: a mechanism for high speed synchronization of processors , 1989, ASPLOS III.

[42]  Bronis R. de Supinski,et al.  Logical time coherence maintenance , 1998 .

[43]  Bob Bentley Validating the Intel/sup (R)/ Pentium/sup (R)/ 4 microprocessor , 2001, 2001 International Conference on Dependable Systems and Networks.

[44]  Anne-Marie Kermarrec,et al.  COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[45]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[46]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[47]  Gurindar S. Sohi,et al.  Speculative versioning cache , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[48]  Liviu Iftode,et al.  Scalable Fault-Tolerant Distributed Shared Memory , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[49]  Christos H. Papadimitriou,et al.  The Theory of Database Concurrency Control , 1986 .

[50]  Yi-Min Wang,et al.  Integrating checkpointing with transaction processing , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[51]  James S. Plank,et al.  An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .

[52]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[53]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[54]  Mark D. Hill,et al.  Using Lamport clocks to reason about relaxed memory models , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[55]  Mark D. Hill,et al.  Lamport clocks: verifying a directory cache-coherence protocol , 1998, SPAA '98.

[56]  R. H. Katz,et al.  Using cache mechanisms to exploit nonrefreshing DRAMs for on-chip memories , 1991 .

[57]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[58]  M. Bohr Interconnect scaling-the real limiter to high performance ULSI , 1995, Proceedings of International Electron Devices Meeting.

[59]  Kunle Olukotun,et al.  Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor , 1997 .

[60]  Kamran Eshraghian,et al.  Principles of CMOS VLSI Design: A Systems Perspective , 1985 .

[61]  M. Hill,et al.  Optimistic Simulation of Parallel Architectures Using Program Executables , 1996, Proceedings of Symposium on Parallel and Distributed Tools.

[62]  W. W. Peterson,et al.  Error-Correcting Codes. , 1962 .

[63]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.

[64]  Erik Hagersten,et al.  WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[65]  Timothy J. Maloney,et al.  The Quality and Reliability of Intel's Quarter Micron Process , 2000 .

[66]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[67]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[68]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[69]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[70]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[71]  Randy H. Katz,et al.  Verifying a multiprocessor cache controller using random test generation , 1990, IEEE Design & Test of Computers.

[72]  R. H. Havemann,et al.  High-performance interconnects: an integration overview , 2001, Proc. IEEE.

[73]  H KatzRandy,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[74]  Manuel Blum,et al.  Reflections on the Pentium Bug , 1996, IEEE Trans. Computers.

[75]  Paul F. Reynolds,et al.  Isotach Networks , 1997, IEEE Trans. Parallel Distributed Syst..

[76]  Manuel Blum,et al.  Designing programs that check their work , 1989, STOC '89.

[77]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[78]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[79]  Antonio Rubio,et al.  An approach to crosstalk effect analysis and avoidance techniques in digital CMOS VLSI circuits , 1988 .

[80]  Shubhendu S. Mukherjee,et al.  The Alpha 21364 network architecture , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[81]  Alan E. Charlesworth,et al.  Starfire: extending the SMP envelope , 1998, IEEE Micro.

[82]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[83]  William J. Dally,et al.  Architecture and implementation of the reliable router , 1994, Symposium Record Hot Interconnects II.

[84]  Rajiv Gupta The fuzzy barrier: a mechanism for high speed synchronization of processors , 1989, ASPLOS 1989.

[85]  Solomon W. Golomb,et al.  Shift Register Sequences , 1981 .

[86]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[87]  Carolyn Craig Williams,et al.  Concurrency control in asynchronous computations , 1993 .

[88]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[89]  M. Hill,et al.  Multicast snooping: a new coherence method using a multicast address network , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[90]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[91]  D. Wilson The STRATUS computer system , 1986 .

[92]  James Laudon,et al.  The SGI Origin: A ccNUMA Highly Scalable Server , 1997, ISCA.

[93]  S. J. Frank,et al.  Tightly coupled multiprocessor system speeds memory-access times , 1984 .

[94]  Bruce S. Davie,et al.  Computer Networks: A Systems Approach , 1996 .

[95]  Robert P. Colwell Maintaining a Leading Position , 1998 .

[96]  Christine Morin,et al.  An Architecture for Tolerating Processor Failures in Shared Memory Multiprocessors , 1996, IEEE Trans. Computers.

[97]  Marc Tremblay,et al.  Increasing Work, Pushing the Clock , 1998 .

[98]  David A. Patterson,et al.  Recovery Oriented Computing: A New Research Agenda for a New Century , 2002, HPCA.

[99]  Richard L. Sites,et al.  Alpha Architecture Reference Manual , 1995 .

[100]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[101]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[102]  Steve Harrison,et al.  Boosting system performance with optimistic distributed protocols , 2001 .

[103]  Carl Ramey,et al.  Functional verification of a multiple-issue, out-of-order, superscalar Alpha processor-the DEC Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[104]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[105]  Melvin A. Breuer,et al.  Digital systems testing and testable design , 1990 .

[106]  Parameswaran Ramanathan,et al.  Checkpointing and rollback recovery in a distributed system using common time base , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[107]  Neil Weste,et al.  Principles of CMOS VLSI Design , 1985 .

[108]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[109]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[110]  Josep Torrellas,et al.  Removing architectural bottlenecks to the scalability of speculative parallelization , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.