FluidCheck: A Redundant Threading-Based Approach for Reliable Execution in Manycore Processors

Soft errors have become a serious cause of concern with reducing feature sizes. The ability to accommodate complex, Simultaneous Multithreading (SMT) cores on a single chip presents a unique opportunity to achieve reliable execution, safe from soft errors, with low performance penalties. In this context, we present FluidCheck, a checker architecture that allows highly flexible assignment and migration of checking duties across cores. In this article, we present a mechanism to dynamically use the resources of SMT cores for checking the results of other threads, and propose a variety of heuristics for migration of such checker threads across cores. Secondly, to make the process of checking more efficient, we propose a set of architectural enhancements that reduce power consumption, decrease the length of the critical path, and reduce the load on the Network-on-Chip (NoC). Based on our observations, we design a 16 core system for running SPEC2006 based bag-of-tasks applications. Our experiments demonstrate that fully reliable execution can be attained with a mere 27p slowdown, surpassing traditional redundant threading based techniques by roughly 42p.

[1]  Smruti R. Sarangi,et al.  A survey of checker architectures , 2013, CSUR.

[2]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[3]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4]  Wei Liu,et al.  Using Register Lifetime Predictions to Protect Register Files Against Soft Errors , 2008 .

[5]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[6]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[7]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[8]  Aneesh Aggarwal,et al.  Speculative instruction validation for performance-reliability trade-off , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[9]  Todd M. Austin,et al.  Efficient checker processor design , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[10]  Jaume Abella,et al.  Selective replication: A lightweight technique for soft errors , 2009, TOCS.

[11]  Kewal K. Saluja,et al.  Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[12]  Susan J. Eggers,et al.  Thread-Sensitive Scheduling for SMT Processors , 2000 .

[13]  José Duato,et al.  L1-bandwidth aware thread allocation in multicore SMT processors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[14]  Smruti R. Sarangi,et al.  ParTejas , 2017, ACM Trans. Model. Comput. Simul..

[15]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[16]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[17]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[18]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Sandhya Dwarkadas,et al.  Compatible phase co-scheduling on a CMP of multi-threaded processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[20]  Michael C. Huang,et al.  Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[21]  Huiyang Zhou,et al.  A case for fault tolerance and performance enhancement using chip multi-processors , 2006, IEEE Computer Architecture Letters.

[22]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[23]  Kewal K. Saluja,et al.  Energy-efficient fault tolerance in chip multiprocessors using Critical Value Forwarding , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[24]  Prathmesh Kallurkar,et al.  Tejas: A java based versatile micro-architectural simulator , 2015, 2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[25]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Francisco J. Cazorla,et al.  Thread to Core Assignment in SMT On-Chip Multiprocessors , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[27]  Kevin Skadron,et al.  Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[28]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[29]  A. Janiszewski,et al.  Architectural support for enhanced SMT job scheduling , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[30]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[31]  M. Marsan,et al.  7 Conclusion and Future Work , 2008 .

[32]  T. N. Vijaykumar,et al.  Opportunistic transient-fault detection , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[33]  Michael C. Huang,et al.  Exploiting coarse-grain verification parallelism for power-efficient fault tolerance , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[34]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[35]  Robert F. Lucas,et al.  An evaluation of lazy fault detection based on Adaptive Redundant Multithreading , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[36]  Kevin Skadron,et al.  Scaling with Design Constraints: Predicting the Future of Big Chips , 2011, IEEE Micro.

[37]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).