FaulTM: Error detection and recovery using Hardware Transactional Memory

Reliability is an essential concern for processor designers due to increasing transient and permanent fault rates. Executing instruction streams redundantly in chip multi processors (CMP) provides high reliability since it can detect both transient and permanent faults. Additionally, it also minimizes the Silent Data Corruption rate. However, comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors might lead to substantial performance degradation. In this study we propose FaulTM, an error detection and recovery schema utilizing Hardware Transactional Memory (HTM) in order to reduce these performance degradations. We show how a minimally modified HTM that features lazy conflict detection and lazy data versioning can provide low-cost reliability in addition to HTM's intended purpose of supporting optimistic concurrency. Compared with lockstepping, FaulTM reduces the performance degradation by 2.5X for SPEC2006 benchmark.

[1]  Mateo Valero Cortés,et al.  FaulTM: Fault-Tolerance Using Hardware Transactional Memory , 2010 .

[2]  Mateo Valero,et al.  SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[4]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[5]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[6]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[7]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[8]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[9]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[10]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[11]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[12]  Eduard Ayguadé,et al.  Transactional Memory: An Overview , 2007, IEEE Micro.

[13]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[14]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[15]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[16]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[17]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[18]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[19]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[20]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[21]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Antonio Rubio,et al.  An approach to crosstalk effect analysis and avoidance techniques in digital CMOS VLSI circuits , 1988 .

[23]  Dan Grossman,et al.  ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.