Epipe: A low-cost fault-tolerance technique considering WCET constraints

Transient faults will soon become a critical reliability concern for processors used in mainstream computing. As the mainstream commodity market accepts only low-cost solutions for transient-fault tolerance, traditional high-end solutions are not acceptable due to their prohibitive costs. This paper presents Epipe, a hybrid software/hardware solution that provides sufficient fault coverage with affordable overhead for mainstream commodity systems. Given a program, Epipe identifies its vulnerable instructions (VIs), i.e., the ones that may cause silent data corruptions (SDCs) by compile-time analysis, and selects a subset of VIs to protect considering worst-case execution time (WCET) constraints in the fault-free execution. During program execution on a modified superscalar processor which incurs minimal hardware overhead, Epipe relies on selective instruction replication to handle the VI-induced SDCs and an existing exception detector to tolerate the remaining faults that manifest as system exceptions. Our experimental results show that Epipe provides sufficient fault coverage under some tight WCET constraints and increasingly higher coverage under more relaxed WCET constraints. As the WCET allowance increases from 5% to 15% and then to 25%, the coverage increases from 70.8% to 80% and then to 86.6% averagely. Unlike existing hybrid solutions, Epipe is the first to respect WCET constraints, which are an important concern for real-time systems.

[1]  Björn Lisper,et al.  Data cache locking for higher program predictability , 2003, SIGMETRICS '03.

[2]  Shekhar Y. Borkar,et al.  Microarchitecture and Design Challenges for Gigascale Integration , 2004, MICRO.

[3]  Isabelle Puaut,et al.  A modular and retargetable framework for tree-based WCET analysis , 2001, Proceedings 13th Euromicro Conference on Real-Time Systems.

[4]  Qing Wan,et al.  WCET-aware data selection and allocation for scratchpad memory , 2012, LCTES 2012.

[5]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[6]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[7]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[8]  Ravishankar K. Iyer,et al.  Processor-Level Selective Replication , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[9]  Yang Shen,et al.  What Is System Hang and How to Handle It , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[10]  Martin Schulz,et al.  A Foundation for the Accurate Prediction of the Soft Error Vulnerability of Scientific Applications , 2009 .

[11]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[12]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[13]  Jingling Xue,et al.  PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs , 2012, Journal of Computer Science and Technology.

[14]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[15]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[16]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[17]  Y. C. Yeh,et al.  Triple-triple redundant 777 primary flight computer , 1996, 1996 IEEE Aerospace Applications Conference. Proceedings.

[18]  Isabelle Puaut,et al.  Scheduling fault-tolerant distributed hard real-time tasks independently of the replication strategies , 1999, Proceedings Sixth International Conference on Real-Time Computing Systems and Applications. RTCSA'99 (Cat. No.PR00306).

[19]  Craig B. Zilles,et al.  A characterization of instruction-level error derating and its implications for error detection , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[20]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[21]  Scott A. Mahlke,et al.  Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.

[22]  Guillem Bernat,et al.  WCET analysis of probabilistic hard real-time systems , 2002, 23rd IEEE Real-Time Systems Symposium, 2002. RTSS 2002..

[23]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[24]  Gerard J. M. Smit,et al.  A mathematical approach towards hardware design , 2010, Dynamically Reconfigurable Architectures.

[25]  Yun Zhou,et al.  The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.

[26]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[27]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[28]  Hui Wu,et al.  Optimal WCET-aware code selection for scratchpad memory , 2010, EMSOFT '10.

[29]  Karthik Pattabiraman,et al.  BLOCKWATCH: Leveraging similarity in parallel programs for error detection , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[30]  Sanjay J. Patel,et al.  Y-branches: when you come to a fork in the road, take it , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[31]  Yun Zhang,et al.  DAFT: decoupled acyclic fault tolerance , 2010, PACT '10.

[32]  Xianfeng Li,et al.  Chronos: A timing analyzer for embedded software , 2007, Sci. Comput. Program..

[33]  Tulika Mitra,et al.  Satisfying real-time constraints with custom instructions , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[34]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[35]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS 2010.

[36]  Björn Lisper,et al.  Data cache locking for tight timing calculations , 2007, TECS.

[37]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[38]  Jing Yu,et al.  ESoftCheck: Removal of Non-vital Checks for Fault Tolerance , 2009, 2009 International Symposium on Code Generation and Optimization.