Elzar: Triple Modular Redundancy using Intel Advanced Vector Extensions (technical report)

Instruction-Level Redundancy (ILR) is a well-known approach to tolerate transient CPU faults. It replicates instructions in a program and inserts periodic checks to detect and correct CPU faults using majority voting, which essentially requires three copies of each instruction and leads to high performance overheads. As SIMD technology can operate simultaneously on several copies of the data, it appears to be a good candidate for decreasing these overheads. To verify this hypothesis, we propose Elzar, a compiler framework that transforms unmodified multithreaded applications to support triple modular redundancy using Intel AVX extensions for vectorization. Our experience with several benchmark suites and real-world case-studies yields mixed results: while SIMD may be beneficial for some workloads, e.g., CPU-intensive ones with many floating-point operations, it exhibits higher overhead than ILR in many applications we tested. We study the sources of overheads and discuss possible improvements to Intel AVX that would lead to better performance.

[1]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[2]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[3]  Christof Fetzer,et al.  HAFT: hardware-assisted fault tolerance , 2016, EuroSys.

[4]  Yun Zhang,et al.  DAFT: Decoupled Acyclic Fault Tolerance , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[6]  Hermann Härtig,et al.  Can we put concurrency back into redundant multithreading? , 2014, 2014 International Conference on Embedded Software (EMSOFT).

[7]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[8]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[9]  Muhammad Shafique,et al.  The EDA challenges in the dark silicon era , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[10]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[11]  Zhi Chen,et al.  Software fault tolerance for FPUs via vectorization , 2015, 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[12]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multi-threading alternatives , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[13]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[14]  Fan Yang,et al.  Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing , 2014, Proc. VLDB Endow..

[15]  Sally A. McKee,et al.  Understanding PARSEC performance on contemporary CMPs , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[17]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[18]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[19]  Gerhard Wellein,et al.  Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips , 2014, WPMVP '14.

[20]  Scott A. Mahlke,et al.  Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.

[21]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[22]  Pramod Bhatotia,et al.  Reliable data-center scale computations , 2010, LADIS '10.

[23]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[24]  Christof Fetzer,et al.  ELZAR: Triple Modular Redundancy Using Intel AVX (Practical Experience Report) , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[25]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[26]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[27]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[28]  Jing Yu,et al.  ESoftCheck: Removal of Non-vital Checks for Fault Tolerance , 2009, 2009 International Symposium on Code Generation and Optimization.