论文信息 - FluidCheck: A Redundant Threading-Based Approach for Reliable Execution in Manycore Processors

FluidCheck: A Redundant Threading-Based Approach for Reliable Execution in Manycore Processors

Soft errors have become a serious cause of concern with reducing feature sizes. The ability to accommodate complex, Simultaneous Multithreading (SMT) cores on a single chip presents a unique opportunity to achieve reliable execution, safe from soft errors, with low performance penalties. In this context, we present FluidCheck, a checker architecture that allows highly flexible assignment and migration of checking duties across cores. In this article, we present a mechanism to dynamically use the resources of SMT cores for checking the results of other threads, and propose a variety of heuristics for migration of such checker threads across cores. Secondly, to make the process of checking more efficient, we propose a set of architectural enhancements that reduce power consumption, decrease the length of the critical path, and reduce the load on the Network-on-Chip (NoC). Based on our observations, we design a 16 core system for running SPEC2006 based bag-of-tasks applications. Our experiments demonstrate that fully reliable execution can be attained with a mere 27p slowdown, surpassing traditional redundant threading based techniques by roughly 42p.

Smruti R. Sarangi | Rajshekar Kalayappan | S. Sarangi | Rajshekar Kalayappan

[1] Smruti R. Sarangi,et al. A survey of checker architectures , 2013, CSUR.

[2] Todd M. Austin,et al. DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[3] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4] Wei Liu,et al. Using Register Lifetime Predictions to Protect Register Files Against Soft Errors , 2008 .

[5] David García,et al. NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[6] Babak Falsafi,et al. Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[7] Josep Torrellas,et al. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[8] Aneesh Aggarwal,et al. Speculative instruction validation for performance-reliability trade-off , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[9] Todd M. Austin,et al. Efficient checker processor design , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[10] Jaume Abella,et al. Selective replication: A lightweight technique for soft errors , 2009, TOCS.

[11] Kewal K. Saluja,et al. Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[12] Susan J. Eggers,et al. Thread-Sensitive Scheduling for SMT Processors , 2000 .

[13] José Duato,et al. L1-bandwidth aware thread allocation in multicore SMT processors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[14] Smruti R. Sarangi,et al. ParTejas , 2017, ACM Trans. Model. Comput. Simul..

[15] Jack L. Lo,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[16] Shubhendu S. Mukherjee,et al. Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[17] Lisa Spainhower,et al. IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[18] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19] Sandhya Dwarkadas,et al. Compatible phase co-scheduling on a CMP of multi-threaded processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[20] Michael C. Huang,et al. Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[21] Huiyang Zhou,et al. A case for fault tolerance and performance enhancement using chip multi-processors , 2006, IEEE Computer Architecture Letters.

[22] Rajiv Kapoor,et al. Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[23] Kewal K. Saluja,et al. Energy-efficient fault tolerance in chip multiprocessors using Critical Value Forwarding , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[24] Prathmesh Kallurkar,et al. Tejas: A java based versatile micro-architectural simulator , 2015, 2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[25] Hyeran Jeon,et al. Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[26] Francisco J. Cazorla,et al. Thread to Core Assignment in SMT On-Chip Multiprocessors , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[27] Kevin Skadron,et al. Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[28] Shubhendu S. Mukherjee,et al. Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[29] A. Janiszewski,et al. Architectural support for enhanced SMT job scheduling , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[30] Andrew B. Kahng,et al. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[31] M. Marsan,et al. 7 Conclusion and Future Work , 2008 .

[32] T. N. Vijaykumar,et al. Opportunistic transient-fault detection , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[33] Michael C. Huang,et al. Exploiting coarse-grain verification parallelism for power-efficient fault tolerance , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[34] K. Sundaramoorthy,et al. Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[35] Robert F. Lucas,et al. An evaluation of lazy fault detection based on Adaptive Redundant Multithreading , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[36] Kevin Skadron,et al. Scaling with Design Constraints: Predicting the Future of Big Chips , 2011, IEEE Micro.

[37] Eric Rotenberg,et al. AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).