A fault-tolerant programmable voter for software-based N-modular redundancy

This paper presents a fault-tolerant, programmable voter architecture for software-implemented N-tuple modular redundant (NMR) computer systems. Software NMR is a cost-efficient solution for high-performance, mission-critical computer systems because this can be built on top of commercial off-the-shelf (COTS) devices. Due to the large volume and randomness of voting data, software NMR system requires a programmable voter. Our experiment shows that voting software that executes on a processor has the time-of-check-to-time-of-use (TOCTTOU) vulnerabilities and is unable to tolerate long duration faults. In order to address these two problems, we present a special-purpose voter processor and its embedded software architecture. The processor has a set of new instructions and hardware modules that are used by the software in order to accelerate the voting software execution and address the identified two reliability problems. We have implemented the presented system on an FPGA platform. Our evaluation result shows that using the presented system reduces the execution time of error detection codes (commonly used in voting software) by 14% and their code size by 56%. Our fault injection experiments validate that the presented system removes the TOCTTOU vulnerabilities and recovers under both transient and long duration faults. This is achieved by using 0.7% extra hardware in a baseline processor.

[1]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[2]  Ravishankar K. Iyer,et al.  An architectural framework for providing reliability and security support , 2004, International Conference on Dependable Systems and Networks, 2004.

[3]  Jason Cong,et al.  Application-specific instruction generation for configurable processor architectures , 2004, FPGA '04.

[4]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[5]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[6]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[7]  Edward J. McCluskey,et al.  Word-voter: a new voter design for triple modular redundant systems , 2000, Proceedings 18th IEEE VLSI Test Symposium.

[8]  George M. Castillo,et al.  Single event upset testing of commercial off-the-shelf electronics for launch vehicle applications , 2011, 2011 Aerospace Conference.

[9]  Nicolas Ventroux,et al.  Impact of the application activity on intermittent faults in embedded systems , 2011, 29th VLSI Test Symposium.

[10]  Pedro J. Gil,et al.  Experimental validation of a fault tolerant microcomputer system against intermittent faults , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[11]  B. Pfarr,et al.  Exploring the possibilities: Earth and space science missions in the context of exploration , 2006, 2006 IEEE Aerospace Conference.

[12]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[13]  Ravishankar K. Iyer,et al.  Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment , 2000, IEEE Trans. Knowl. Data Eng..

[14]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[15]  Dan Tsafrir,et al.  System noise, OS clock ticks, and fine-grained parallel applications , 2005, ICS '05.

[16]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[17]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[18]  Neha Narula,et al.  Native Client: A Sandbox for Portable, Untrusted x86 Native Code , 2009, IEEE Symposium on Security and Privacy.

[19]  Milos Krstic,et al.  FPGA implementation of hardware voter , 2001, 5th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Service. TELSIKS 2001. Proceedings of Papers (Cat. No.01EX517).

[20]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[22]  Ravishankar K. Iyer,et al.  Quantitative Analysis of Long-Latency Failures in System Software , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[23]  D.A. Rennels,et al.  Fault Injection Campaign for a Fault Tolerant Duplex Framework , 2007, 2007 IEEE Aerospace Conference.

[24]  Miguel Castro,et al.  Preventing Memory Error Exploits with WIT , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[25]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[26]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[27]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[28]  Diana Marculescu,et al.  Multiple Transient Faults in Combinational and Sequential Circuits: A Systematic Approach , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Hovav Shacham,et al.  On the effectiveness of address-space randomization , 2004, CCS '04.

[30]  Ravishankar K. Iyer,et al.  Microprocessor sensitivity to failures: control vs. execution and combinational vs. sequential logic , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[31]  Sayed Mohammad Kia,et al.  Micro embedded monitoring for security in application specific instruction-set processors , 2005, CASES '05.

[32]  E. Macii,et al.  Look-up table FPGA realization of m-out-of-n bit voters , 1994, 1994 Proceedings of Canadian Conference on Electrical and Computer Engineering.

[33]  Ravishankar K. Iyer,et al.  An end-to-end approach for the automatic derivation of application-aware error detectors , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[34]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[35]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[36]  Miguel Castro,et al.  Fast byte-granularity software fault isolation , 2009, SOSP '09.