Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery

The trend of downsizing transistors and operating voltage scaling has made the processor chip more sensitive against radiation phenomena making soft errors an important challenge. New reliability techniques for handling soft errors in the logic and memories that allow meeting the desired failures-in-time (FIT) target are key to keep harnessing the benefits of Moore's law. The failure to scale the soft error rate caused by particle strikes, may soon limit the total number of cores that one may have running at the same time. This paper proposes a light-weight and scalable architecture to eliminate silent data corruption errors (SDC) and detected unrecoverable errors (DUE) of a core. The architecture uses acoustic wave detectors for error detection. We propose to recover by confining the errors in the cache hierarchy, allowing us to deal with the relatively long detection latencies. Our results show that the proposed mechanism protects the whole core (logic, latches and memory arrays) incurring performance overhead as low as 0.60%.

[1]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[2]  Christos A. Papachristou,et al.  An efficient BICS design for SEUs detection and correction in semiconductor memories , 2005, Design, Automation and Test in Europe.

[3]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[4]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[5]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[6]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[7]  Kanad Ghose,et al.  Early Register Deallocation Mechanisms Using Checkpointed Register Files , 2006, IEEE Transactions on Computers.

[8]  R. Baumann Soft errors in advanced semiconductor devices-part I: the three radiation sources , 2001 .

[9]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[10]  Todd M. Austin,et al.  Ultra low-cost defect protection for microprocessor pipelines , 2006, ASPLOS XII.

[11]  K ReinhardtSteven,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002 .

[12]  Cheng Wang,et al.  LAR-CC: Large atomic regions with conditional commits , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[13]  Michael C. Huang,et al.  Variation-tolerant hierarchical voltage monitoring circuit for soft error detection , 2009, 2009 10th International Symposium on Quality Electronic Design.

[14]  Xavier Vera,et al.  Setting an error detection infrastructure with low cost acoustic wave detectors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[15]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[16]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[17]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[18]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[19]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[20]  Tino Heijmen,et al.  Radiation-induced soft errors in digital circuits - A literature survey , 2002 .

[21]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[22]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[23]  Pia Sanda,et al.  Soft Errors: Technology Trends, System Effects, and Protection Techniques , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[24]  Donald Yeung,et al.  Hill-climbing SMT processor resource distribution , 2009, TOCS.

[25]  Jaume Abella,et al.  Hardware/software-based diagnosis of load-store queues using expandable activity logs , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[26]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[27]  Gary S. Tyson,et al.  Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[28]  Jiri Gaisler A portable and fault-tolerant microprocessor based on the SPARC v8 architecture , 2002, Proceedings International Conference on Dependable Systems and Networks.

[29]  José F. Martínez,et al.  Cherry-MP: correctly integrating checkpointed early resource recycling in chip multiprocessors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[30]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[31]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[32]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[33]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[34]  Karthik Ramani,et al.  Microarchitectural wire management for performance and power in partitioned architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[35]  Jaume Abella,et al.  Selective replication: A lightweight technique for soft errors , 2009, TOCS.

[36]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[37]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[38]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[39]  Norman P. Jouppi,et al.  Architecting Efficient Interconnects for Large Caches with CACTI 6.0 , 2008, IEEE Micro.

[40]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[41]  Xavier Vera,et al.  Reducing DUE-FIT of caches by exploiting acoustic wave detectors for error recovery , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).

[42]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[43]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[44]  Michael C. Huang,et al.  Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[45]  Zheng Feng Huang,et al.  BISS: A Built-In SEU Sensor for Soft Error Mitigation , 2011 .

[46]  The design and construction of a mechanical radiation detector , 1998, 1998 IEEE Nuclear Science Symposium Conference Record. 1998 IEEE Nuclear Science Symposium and Medical Imaging Conference (Cat. No.98CH36255).

[47]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[48]  Michael C. Huang,et al.  Cherry: checkpointed early resource recycling in out-of-order microprocessors , 2002, MICRO.

[49]  Mark D. Hammig Nuclear radiation detection via the deflection of pliable microstructures , 1999 .

[50]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[51]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[52]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..