Parallel Error Detection Using Heterogeneous Cores

Microprocessor error detection is increasingly important, as the number of transistors in modern systems heightens their vulnerability. In addition, many modern workloads in domains such as the automotive and health industries are increasingly error intolerant, due to strict safety standards. However, current detection techniques require duplication of all hardware structures, causing a considerable increase in power consumption and chip area. Solutions in the literature involve running the code multiple times on the same hardware, which reduces performance significantly and cannot capture all errors. We have designed a novel hardware-only solution for error detection, that exploits parallelism in checking code which may not exist in the original execution. We pair a high-performance out-of-order core with a set of small low-power cores, each of which checks a portion of the out-of-order core's execution. Our system enables the detection of both hard and soft errors, with low area, power and performance overheads.

[1]  Krisztián Flautner,et al.  Evolution of thread-level parallelism in desktop applications , 2010, ISCA.

[2]  Scott A. Mahlke,et al.  Harnessing Soft Computations for Low-Budget Fault Tolerance , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[4]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[5]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[6]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[7]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[8]  Al Geist,et al.  IESP Exascale Challenge: Co-Design of Architectures and Algorithms , 2009, Int. J. High Perform. Comput. Appl..

[9]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[10]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[11]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[12]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Daniel J. Sorin,et al.  Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[15]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[16]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[17]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[18]  Michael C. Huang,et al.  Exploiting coarse-grain verification parallelism for power-efficient fault tolerance , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[19]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[20]  Amin Ansari,et al.  StageWeb: Interweaving pipeline stages into a wearout and variation tolerant CMP fabric , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[21]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[22]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[23]  Anant Jhingran,et al.  Analysis of recovery in a database system using a write-ahead log protocol , 1992, SIGMOD '92.

[24]  E. Fluhr,et al.  Design and Implementation of the POWER6 Microprocessor , 2008, IEEE Journal of Solid-State Circuits.

[25]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[26]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[28]  Timothy M. Jones,et al.  COMET: Communication-optimised multi-threaded error-detection technique , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[29]  Satish Narayanasamy,et al.  DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[30]  Amin Ansari,et al.  The StageNet fabric for constructing resilient multicore systems , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[31]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[32]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[33]  Renato J. O. Figueiredo,et al.  A Flexible Approach to Improving System Reliability with Virtual Lockstep , 2012, IEEE Transactions on Dependable and Secure Computing.

[34]  A. Robert Pargeter 80.35 An example of strong induction , 1996 .

[35]  Ronald G. Dreslinski,et al.  Sources of error in full-system simulation , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[36]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[37]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[38]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[39]  Jianbin Fang,et al.  Test-driving Intel Xeon Phi , 2014, ICPE.

[40]  Shidhartha Das,et al.  A Triple Core Lock-Step (TCLS) ARM® Cortex®-R5 Processor for Safety-Critical and Ultra-Reliable Applications , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[41]  Shuhei Yamashita,et al.  Introduction of ISO 26262 'Road vehicles-Functional safety' , 2012 .

[42]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[43]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[44]  Koji Nii,et al.  13.3 20nm High-density single-port and dual-port SRAMs with wordline-voltage-adjustment system for read/write assists , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[45]  Jorg Henkel,et al.  Agent-based distributed power management for kilo-core processors , 2013, ICCAD.

[46]  T. N. Vijaykumar,et al.  BlackJack: Hard Error Detection with Redundant Threads on SMT , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[47]  M. Rausand Reliability of Safety-Critical Systems: Theory and Applications , 2014 .

[48]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[49]  Ke Wang,et al.  Exploring reliability of exascale systems through simulations , 2013, SpringSim.

[50]  Devesh Tiwari,et al.  Clover: Compiler Directed Lightweight Soft Error Resilience , 2015, LCTES.

[51]  Necromancer: enhancing system throughput by animating dead cores , 2010, ISCA '10.

[52]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[53]  Jaume Abella,et al.  Timely Error Detection for Effective Recovery in Light-Lockstep Automotive Systems , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[54]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.