FaulTM: Fault-Tolerance Using Hardware Transactional Memory

Fault-tolerance has become an essential concern for processor designers due to increasing soft-error rates. In this study, we are motivated by the fact that Transactional Memory (TM) hardware provides an ideal base upon which to build a fault-tolerant system. We show how it is possible to provide low-cost faulttolerance for serial programs by using a minimallymodified Hardware Transactional Memory (HTM) that features lazy conflict detection, lazy data versioning. This scheme, called FaulTM, employs a hybrid hardware-software fault-tolerance technique. On the software side, FaulTM programming model is able to provide the flexibility for programmers to decide between performance and reliability. Our experimental results indicate that FaulTM produces relatively less performance overhead by reducing the number of comparisons and by leveraging already proposed TM hardware. We also conduct experiments which indicate that the baseline FaulTM design has a good error coverage. To the best of our knowledge, this is the first architectural fault-tolerance proposal using Hardware Transactional Memory.

[1]  Gürhan Küçük,et al.  Energy efficient comparators for superscalar datapaths , 2004, IEEE Transactions on Computers.

[2]  Babak Falsafi,et al.  The Granularity of Soft-Error Containment in Shared-Memory Multiprocessors , 2006 .

[3]  Janak H. Patel,et al.  Concurrent Error Detection in ALU's by Recomputing with Shifted Operands , 1982, IEEE Transactions on Computers.

[4]  Shantanu Gupta,et al.  Using hardware transactional memory for data race detection , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5]  Todd M. Austin DIVA: A Dynamic Approach to Microprocessor Verification , 2000, J. Instr. Level Parallelism.

[6]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[7]  Monica S. Lam,et al.  Enhancing software reliability with speculative threads , 2002, ASPLOS X.

[8]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  David I. August,et al.  Configurable Transient Fault Detection via Dynamic Binary Translation , 2006 .

[10]  James R. Goodman,et al.  Transactional Value Prediction , 2009 .

[11]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.

[12]  Mateo Valero,et al.  EazyHTM: EAger-LaZY hardware Transactional Memory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Marc Tremblay,et al.  A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[14]  Israel Koren,et al.  Fault-Tolerant Systems , 2007 .

[15]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[16]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[17]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[18]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[19]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[20]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[21]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[22]  Tipp Moseley,et al.  SoRProcessor Cache DevicesMemory Application Libraries Operating System ( a ) Hardware − centric ( b ) Software − centric , 2006 .

[23]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[24]  Dean M. Tullsen,et al.  Mapping Out a Path from Hardware Transactional Memory to Speculative Multithreading , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[25]  Mateo Valero,et al.  Dynamically Filtering Thread-Local Variables in Lazy-Lazy Hardware Transactional Memory , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[26]  J. Baylis Error-correcting Codes , 2014 .