Supporting highly-decoupled thread-level redundancy for parallel programs

The continued scaling of device dimensions and the operating voltage reduces the critical charge and thus natural noise tolerance level of transistors. As a result, circuits can produce transient upsets that corrupt program execution and data. Redundant execution can detect and correct circuit errors on the fly. The increasing prevalence of multi-core architectures makes coarse-grain thread-level redundancy (TLR) very attractive. While TLR has been extensively studied in the context of single-threaded applications, much less attention is paid to the design issues and tradeoffs of supporting parallel codes. In this paper, we propose a microarchitecture to efficiently support TLR for parallel codes. One of the main design goals is to support a large number of unverified instructions, so that long latencies in verification can be easily tolerated. Another important objective is to have a comprehensive coverage that includes not only the computation logic but also the coherence and consistency logic in the memory subsystem. Hence, the redundant copy of the program needs to independently access the memory and the system needs to efficiently manage the non-determinism in parallel execution. The proposed architectural support to achieve these goals is entirely off the processor critical path and can be easily disabled when redundancy is not requested. The design, with a few effective optimizations, is also efficient in that during error-free execution, it causes less than 3% additional performance degradation on top of throughput loss due to redundancy.

[1]  Vivek De,et al.  Measurements and analysis of SER-tolerant latch in a 90-nm dual-V/sub T/ CMOS process , 2004 .

[2]  Satish Narayanasamy,et al.  Recording shared memory dependencies using strata , 2006, ASPLOS XII.

[3]  Norman P. Jouppi,et al.  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[4]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[5]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[6]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[7]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[9]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[10]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[11]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  Marc Tremblay,et al.  The implementation and application of micro rollback in fault-tolerant VLSI systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[14]  John A. Rohr STAREX SELF-REPAIR ROUTINES: SOFTWARE RECOVERY IN THE JPL-STAR COMPUTER , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[15]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[16]  Jill J. Hallenbeck,et al.  Modulo 3 Residue Checker: New Results on Performance and Cost , 1988, IEEE Trans. Computers.

[17]  Hiroyuki Sugiyama,et al.  A 1.3 GHz fifth generation SPARC64 microprocessor , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[18]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[19]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[20]  Yuval Tamir,et al.  Application-transparent process-level error recovery for multicomputers , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[21]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[22]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[23]  Dhiraj K. Pradhan,et al.  Fault-Tolerant Computing , 2008, Wiley Encyclopedia of Computer Science and Engineering.

[24]  Janak H. Patel,et al.  Concurrent Error Detection in Multiply and Divide Arrays , 1983, IEEE Transactions on Computers.

[25]  Todd M. Austin,et al.  Efficient checker processor design , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[26]  Harry Muljono,et al.  A 1.5-GHz 130-nm Itanium/sup /spl reg// 2 Processor with 6-MB on-die L3 cache , 2003 .

[27]  W. W. Peterson On Checking an Adder , 1958, IBM J. Res. Dev..

[28]  Babak Falsafi,et al.  Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[29]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[30]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[31]  Ram Chillarege,et al.  IBM's ES/9000 Model 982's fault-tolerant design for consolidation , 1994, IEEE Micro.

[32]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[33]  Min Xu,et al.  A regulated transitive reduction (RTR) for longer memory race recording , 2006, ASPLOS XII.

[34]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[35]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[36]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[37]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[38]  Min Yinghua Dependable Systems and Networks , 2001 .

[39]  Andrea Bondavalli,et al.  Efficient fault tolerance: an approach to deal with transient faults in multiprocessor architectures , 1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems.

[40]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[41]  Michael Dowd,et al.  Designing A Single Board Computer For Space Using the Most Advanced Processor and Mitigation Technologies , 2004 .

[42]  N. Ghani,et al.  A Recovery Cache for the PDP-11 , 1980, IEEE Transactions on Computers.

[43]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[44]  P. Eaton,et al.  Soft error rate mitigation techniques for modern microcircuits , 2002, 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat. No.02CH37320).

[45]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[46]  J. Ziegler,et al.  Effect of Cosmic Rays on Computer Memories , 1979, Science.

[47]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[48]  R. Hokinson,et al.  Historical trend in alpha-particle induced soft error rates of the Alpha/sup TM/ microprocessor , 2001, 2001 IEEE International Reliability Physics Symposium Proceedings. 39th Annual (Cat. No.00CH37167).

[49]  N. Seifert,et al.  Robust system design with built-in soft-error resilience , 2005, Computer.

[50]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[51]  Michael C. Huang,et al.  Exploiting coarse-grain verification parallelism for power-efficient fault tolerance , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[52]  Anand Sivasubramaniam,et al.  A complexity-effective approach to ALU bandwidth enhancement for instruction-level temporal redundancy , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[53]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[54]  A A Schäffer,et al.  Parallelization of general-linkage analysis problems. , 1994, Human heredity.

[55]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[56]  Christos A. Papachristou,et al.  An efficient BICS design for SEUs detection and correction in semiconductor memories , 2005, Design, Automation and Test in Europe.

[57]  Kang G. Shin,et al.  Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks , 1984, IEEE Transactions on Computers.

[58]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.

[59]  K. Soumyanath,et al.  Measurements and analysis of SER tolerant latch in a 90 nm dual-Vt CMOS process , 2003, Proceedings of the IEEE 2003 Custom Integrated Circuits Conference, 2003..

[60]  Babak Falsafi,et al.  TRUSS: a reliable, scalable server architecture , 2005, IEEE Micro.

[61]  Compilation Techniques,et al.  Parallel architectures and compilation techniques , 1995 .

[62]  Changhong Dai,et al.  Impact of CMOS process scaling and SOI on the soft error rates of logic processes , 2001, 2001 Symposium on VLSI Technology. Digest of Technical Papers (IEEE Cat. No.01 CH37184).