Towards scalable reliability frameworks for error prone CMPs

As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.

[1]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[2]  J. Rupe Reliability of Computer Systems and Networks Fault Tolerance, Analysis, and Design , 2003 .

[3]  Shlomo Weiss,et al.  DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals , 2008, IEEE Computer Architecture Letters.

[4]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[5]  Shinobu Fujita,et al.  Modeling and analysis of circuit performance of ballistic CNFET , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[6]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[7]  G. Q. Zhang,et al.  Reliability challenges in the nanoelectronics era , 2006, Microelectron. Reliab..

[8]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[9]  D. Jewett,et al.  Integrity S2: A Fault-Tolerant Unix Platform , 1991, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[10]  Multiprocessor Systems A Case Study of C.mmp, Cm", and C.vmp: Part I- Experiences with Fault Tolerance in , 1978 .

[11]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[12]  Dimiter R. Avresky,et al.  Evaluation of Software-Implemented Fault-Tolerance (SIFT) Approach in Gracefully Degradable Multi-Computer Systems , 2006, IEEE Transactions on Reliability.

[13]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[14]  Juan L. Aragón,et al.  Evaluating Dynamic Core Coupling in a Scalable Tiled-CMP Architecture , 2008 .

[15]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[16]  Y. Aoyagi,et al.  Carbon nanotube devices for nanoelectronics , 2002 .

[17]  D.P. Siewiorek,et al.  A case study of C.mmp, Cm*, and C.vmp: Part I—Experiences with fault tolerance in multiprocessor systems , 1978, Proceedings of the IEEE.

[18]  N. D. Durie,et al.  Digest of papers , 1976 .

[19]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[20]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[21]  Martin L. Shooman,et al.  Reliability of computer systems and networks , 2001 .

[22]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[23]  A.M. Ionescu,et al.  New functionality and ultra low power: key opportunities for post-CMOS era , 2008, 2008 International Symposium on VLSI Technology, Systems and Applications (VLSI-TSA).

[24]  Todd M. Austin DIVA: A Dynamic Approach to Microprocessor Verification , 2000, J. Instr. Level Parallelism.

[25]  Krishna V. Palem,et al.  Probabilistic system-on-a-chip architectures , 2007, TODE.