REDAC: Distributed, Asynchronous Redundancy in Shared Memory Servers

The emergence of multi-core architectures—driven by continued technology scaling—has led to concerns about increasing soft- and hard-error rates in commodity designs. Because modern chip designs consist of multiple high-speed clock domains, conventional lockstepped redundant execution is no longer practical. Recent work suggests an asynchronous approach to redundant execution, where processor pairs independently execute an instruction stream and treat any differences like soft errors, invoking rollback recovery. Because prior designs buffer instruction results within the out-of-order instruction window, they are limited to tightly coupled redundancy within a single chip, which limits availability and serviceability in the presence of hard errors. We propose REDAC, a set of lightweight mechanisms for distributed, asynchronous redundancy within a sharedmemory multiprocessor. REDAC provides scalable buffering for unchecked state updates, permitting the distribution of redundant execution across multiple nodes of a scalable shared-memory server. The REDAC mechanisms achieve high performance by enabling speculation across common serializing instructions and mitigating the effects of input incoherence. We evaluate REDAC using cycle-accurate fullsystem simulation of common enterprise workloads and show that performance overheads average just 10% when compared to a non-redundant system. These results are comparable to the performance of a similarly configured lockstep design, but offer the substantial benefits of asynchronous redundancy.

[1]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[2]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[3]  Sule Ozev,et al.  A mechanism for online diagnosis of hard faults in microprocessors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[4]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[5]  P.N. Sanda,et al.  IBM z990 soft error detection and recovery , 2005, IEEE Transactions on Device and Materials Reliability.

[6]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[7]  Charles F. Webb IBM z10: The Next-Generation Mainframe Microprocessor , 2008, IEEE Micro.

[8]  Neeraj Suri,et al.  Designing high-performance and reliable superscalar architectures-the out of order reliable superscalar (O3RS) approach , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[9]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[10]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[11]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[12]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[13]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[14]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[15]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[16]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[17]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[18]  J.F. Martinez,et al.  Cherry: Checkpointed early resource recycling in out-of-order microprocessors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[19]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[20]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[21]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[22]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[23]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[24]  Jared C. Smolens,et al.  Fingerprinting: hash-based error detection in microprocessors , 2007 .

[25]  Babak Falsafi,et al.  TRUSS: a reliable, scalable server architecture , 2005, IEEE Micro.

[26]  Onur Mutlu,et al.  Runahead Execution: An Effective Alternative to Large Instruction Windows , 2003, IEEE Micro.

[27]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[28]  T. N. Vijaykumar,et al.  Is SC+ILP=RC? , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[29]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[30]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[31]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[32]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[33]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[34]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[35]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[36]  Josep Torrellas,et al.  ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[37]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[38]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[39]  Charles Webb IBM z6 - the next-generation mainframe microprocessor , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[40]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[41]  Jose Renau,et al.  CAVA: Using checkpoint-assisted value prediction to hide L2 misses , 2006, TACO.

[42]  Pat Conway,et al.  The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.

[43]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[44]  Babak Falsafi,et al.  DBmbench: fast and accurate database workload representation on modern microarchitecture , 2005, CASCON.

[45]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[46]  Kanad Ghose,et al.  Increasing processor performance through early register release , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..