SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback

Existing cache-level checkpointing schemes do not continuously support a large rollback window. Immediately after a checkpoint, the number of instructions that the processor can undo falls to zero. To address this problem, we introduce Swich, an FPGA-based prototype of a new cache-level scheme that keeps two live checkpoints at all times, forming a sliding rollback window that maintains a large minimum and average length

[1]  Dhiraj K. Pradhan,et al.  Processor- and memory-based checkpoint and rollback recovery , 1993, Computer.

[2]  W. Kent Fuchs,et al.  The Performance of Cache-Based Error Recovery in Multiprocessors , 1994, IEEE Trans. Parallel Distributed Syst..

[3]  A. Gefflaut,et al.  COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[4]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[5]  Josep Torrellas,et al.  Prototyping architectural support for program rollback using FPGAs , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[6]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[7]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[8]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.

[9]  Hiroyuki Sugiyama,et al.  A 1.3 GHz fifth generation SPARC64 microprocessor , 2003 .

[10]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[11]  H. Ando,et al.  A 1.3GHz fifth generation SPARC64 microprocessor , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[12]  Daniel P. Siewiorek,et al.  Reliable Computer Systems: Design and Evaluation, Third Edition , 1998 .

[13]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[14]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.