Exploiting Idle Hardware to Provide Low Overhead Fault Tolerance for VLIW Processors

Because of technology scaling, the soft error rate has been increasing in digital circuits, which affects system reliability. Therefore, modern processors, including VLIW architectures, must have means to mitigate such effects to guarantee reliable computing. In this scenario, our work proposes three low overhead fault tolerance approaches based on instruction duplication with zero latency detection, which uses a rollback mechanism to correct soft errors in the pipelanes of a configurable VLIW processor. The first uses idle issue slots within a period of time to execute extra instructions considering distinct application phases. The second works at a finer grain, adaptively exploiting idle functional units at run-time. However, some applications present high instruction-level parallelism (ILP), so the ability to provide fault tolerance is reduced: less functional units will be idle, decreasing the number of potential duplicated instructions. The third approach attacks this issue by dynamically reducing ILP according to a configurable threshold, increasing fault tolerance at the cost of performance. While the first two approaches achieve significant fault coverage with minimal area and power overhead for applications with low ILP, the latter improves fault tolerance with low performance degradation. All approaches are evaluated considering area, performance, power dissipation, and error coverage.

[1]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[2]  Wang Qian,et al.  GS-DMR , 2015 .

[3]  Stephan Wong,et al.  Adaptive ILP control to increase fault tolerance for VLIW processors , 2016, 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[4]  Ying Zhang,et al.  Dynamic adaptation for fault tolerance and power management in embedded real-time systems , 2004, TECS.

[5]  Xuejun Yang,et al.  GS-DMR: Low-overhead soft error detection scheme for stencil-based computation , 2015, Parallel Comput..

[6]  Alan D. George,et al.  Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing , 2012, TRETS.

[7]  Xin Fu,et al.  RISE: Improving the streaming processors reliability against soft errors in GPGPUs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[9]  Carlos Villalpando,et al.  Reliable multicore processors for NASA space missions , 2011, 2011 Aerospace Conference.

[10]  Geoffrey Brown,et al.  ρ-VEX: A reconfigurable and extensible softcore VLIW processor , 2008, 2008 International Conference on Field-Programmable Technology.

[11]  Stephan Wong,et al.  Configurable Fault-Tolerance for a Configurable VLIW Processor , 2013, ARC.

[12]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[13]  Luigi Carro,et al.  Adaptable Embedded Systems , 2012 .

[14]  Matteo Sonza Reorda,et al.  On the Functional Test of Branch Prediction Units , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Vasileios Porpodas,et al.  DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance , 2013, LCPC.

[16]  Atsuhiro Suga,et al.  Introducing the FR500 Embedded Microprocessor , 2000, IEEE Micro.

[17]  Sumedh W. Sathaye,et al.  Instruction fetch mechanisms for VLIW architectures with compressed encodings , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[18]  Mahmut T. Kandemir,et al.  Compiler-directed instruction duplication for soft error detection , 2005, Design, Automation and Test in Europe.

[19]  Steven M. Nowick,et al.  ACM Journal on Emerging Technologies in Computing Systems , 2010, TODE.

[20]  Luigi Carro,et al.  A Novel Phase-Based Low Overhead Fault Tolerance Approach for VLIW Processors , 2015, 2015 IEEE Computer Society Annual Symposium on VLSI.

[21]  Mahmut T. Kandemir,et al.  Compiler-assisted soft error detection under performance and energy constraints in embedded systems , 2009, TECS.

[22]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[23]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[24]  Marc Tremblay,et al.  The MAJC Architecture: A Synthesis of Parallelism and Scalability , 2000, IEEE Micro.

[25]  J.-M. Yang,et al.  A checkpoint scheme with task duplication considering transient and permanent faults , 2010, 2010 IEEE International Conference on Industrial Engineering and Engineering Management.

[26]  Kannappan Palaniappan,et al.  Performance evaluation for a compressed-VLIW processor , 2002, SAC '02.

[27]  Alex K. Jones,et al.  A VLIW Processor With Hardware Functions: Increasing Performance While Reducing Power , 2006, IEEE Transactions on Circuits and Systems II: Express Briefs.

[28]  Stephan Wong,et al.  A sparse VLIW instruction encoding scheme compatible with generic binaries , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[29]  Pradip Bose,et al.  Microarchitectural techniques for power gating of execution units , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[30]  Stamatis Vassiliadis,et al.  The TM3270 media-processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[31]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[32]  Stephan Wong,et al.  Run-time phase prediction for a reconfigurable VLIW processor , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33]  Marc Tremblay,et al.  THE MAJC ARCHITECTURE: A SYNTHESIS OF , 2000 .

[34]  Harsh Sharangpani,et al.  Itanium Processor Microarchitecture , 2000, IEEE Micro.

[35]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[36]  Cristiana Bolchini A software methodology for detecting hardware faults in VLIW data paths , 2003, IEEE Trans. Reliab..

[37]  Rami G. Melhem,et al.  Shadow Computing: An energy-aware fault tolerant computing model , 2014, 2014 International Conference on Computing, Networking and Communications (ICNC).

[38]  Luigi Carro,et al.  Evaluation of energy savings on a VLIW processor through dynamic issue-width adaptation , 2015, 2015 International Symposium on Rapid System Prototyping (RSP).

[39]  Jiri Gaisler Evaluation of a 32-bit microprocessor with built-in concurrent error-detection , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[40]  Ulrich Rückert,et al.  CoreVA: A Configurable Resource-Efficient VLIW Processor Architecture , 2014, 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing.

[41]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[42]  Paolo Faraboschi,et al.  Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools , 2004 .

[43]  Yung-Yuan Chen,et al.  Reliable data path design of VLIW processor cores with comprehensive error-coverage assessment , 2010, Microprocess. Microsystems.

[44]  Ravishankar K. Iyer,et al.  Processor-Level Selective Replication , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[45]  Scott A. Mahlke,et al.  Code size minimization and retargetable assembly for custom EPIC and VLIW instruction formats , 2000, TODE.

[46]  Mario Scholzel Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors , 2007, Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007.

[47]  Kewal K. Saluja,et al.  Energy-efficient fault tolerance in chip multiprocessors using Critical Value Forwarding , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[48]  Andy D. Pimentel,et al.  TriMedia CPU64 architecture , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).