Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study

Self-repair replaces/bypasses faulty components in a system-on-chip (SoC) to keep the system functioning correctly even in the presence of permanent faults. Such faults may result from early-life failures, circuit aging, and manufacturing defects and variations. Unlike on-chip memories, processor cores, and networks-on-chip, little attention has been paid to self-repair of uncore components (e.g., cache controllers, memory controllers, and I/O controllers) that occupy significant portions of multi-core SoCs. In this paper, we present new techniques that utilize architectural features to achieve self-repair of uncore components while incurring low area, power, and performance costs. We demonstrate the effectiveness and practicality of our techniques, using the industrial OpenSPARC T2 SoC with 8 processor cores that support 64 hardware threads. Our key results are: 1. Our techniques enable effective self-repair of any single faulty uncore component with 7.5% post-layout chip-level area impact and 3% power impact. In contrast, existing redundancy techniques impose high (e.g., 16%) area costs. Our techniques do not incur any performance impact in fault-free systems. In the presence of a single faulty uncore component, there can be a 5% application performance impact. 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with graceful degradation of application performance. 3. Our techniques achieve high self-repair coverage of 97.5% in the presence of a single fault. Our self-repair techniques also enable flexible tradeoffs between self-repair coverage and area costs. For example, 75% self-repair coverage can be achieved with 3.2% post-layout chip-level area impact.

[1]  Edward J. McCluskey,et al.  PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[2]  Coniferous softwood GENERAL TERMS , 2003 .

[3]  Hyunki Kim,et al.  Low-cost gate-oxide early-life failure detection in robust systems , 2010, 2010 Symposium on VLSI Circuits.

[4]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[5]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Melvin A. Breuer,et al.  Roving Emulation as a Fault Detection Mechanism , 1986, IEEE Transactions on Computers.

[7]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, Proceedings 21st International Conference on Computer Design.

[8]  Edward J. McCluskey,et al.  Which concurrent error detection scheme to choose ? , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).

[9]  Janusz Rajski,et al.  A Rapid Yield Learning Flow Based on Production Integrated Layout-Aware Diagnosis , 2006, 2006 IEEE International Test Conference.

[10]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Mario Schölzel,et al.  Fine-Grained Software-Based Self-Repair of VLIW Processors , 2011, 2011 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems.

[12]  Howard Jay Siegel,et al.  A survey and comparison of fault-tolerant multistage interconnection networks , 1994 .

[13]  U. Schlichtmann,et al.  Goldilocks failures: Not too soft, not too hard , 2012, 2012 IEEE International Reliability Physics Symposium (IRPS).

[14]  Subhasish Mitra,et al.  CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns , 2008, 2008 Design, Automation and Test in Europe.

[15]  Josep Torrellas,et al.  Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[16]  Lieven Eeckhout,et al.  Automated microprocessor stressmark generation , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[17]  David Blaauw,et al.  Compact In-Situ Sensors for Monitoring Negative-Bias-Temperature-Instability Effect and Oxide Degradation , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[18]  Daniel J. Sorin,et al.  Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Gérard Memmi,et al.  A reconfigurable design-for-debug infrastructure for SoCs , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[20]  Christoforos E. Kozyrakis,et al.  The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  R. D. Blanton,et al.  On-chip diagnosis for early-life and wear-out failures , 2012, 2012 IEEE International Test Conference.

[22]  Santosh G. Abraham,et al.  Effective instruction prefetching in chip multiprocessors for modern commercial applications , 2005, 11th International Symposium on High-Performance Computer Architecture.

[23]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[24]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[25]  Mohammad Mirza-Aghatabar,et al.  A design flow to maximize yield/area of physical devices via redundancy , 2012, 2012 IEEE International Test Conference.

[26]  S. Pae,et al.  Random charge effects for PMOS NBTI in ultra-small gate area devices , 2005, 2005 IEEE International Reliability Physics Symposium, 2005. Proceedings. 43rd Annual..

[27]  Paul R. Turgeon,et al.  Two approaches to array fault tolerance in the IBM Enterprise System/9000 Type 9121 processor , 1991, IBM J. Res. Dev..

[28]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[29]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[30]  Onur Mutlu,et al.  Concurrent autonomous self-test for uncore components in system-on-chips , 2010, 2010 28th VLSI Test Symposium (VTS).

[31]  Yervant Zorian,et al.  Embedded-memory test and repair: infrastructure IP for SoC yield , 2003, IEEE Design & Test of Computers.

[32]  J. Hicks 45nm Transistor Reliability , 2008 .

[33]  Jody Van Horn Towards achieving relentless reliability gains in a server marketplace of teraflops, laptops, kilowatts, and "cost, cost, cost"...: making peace between a black art and the bottom line , 2005, ITC.

[34]  Wei Chen,et al.  The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series , 2007, IEEE Journal of Solid-State Circuits.

[35]  Chris Allsup Is built-in logic redundancy ready for prime time? , 2010, 2010 11th International Symposium on Quality Electronic Design (ISQED).

[36]  Stephen P. Boyd,et al.  Self-Tuning for Maximized Lifetime Energy-Efficiency in the Presence of Circuit Aging , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[37]  Antonio Robles,et al.  An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori , 2004, IEEE Computer Architecture Letters.

[38]  Steffen Paul,et al.  Memory built-in self-repair using redundant words , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[39]  J. A. Cunningham The use and evaluation of yield models in integrated circuit manufacturing , 1990 .

[40]  Yervant Zorian,et al.  Built in self repair for embedded high density SRAM , 1998, Proceedings International Test Conference 1998 (IEEE Cat. No.98CH36270).

[41]  Josep Torrellas,et al.  ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[42]  S. Gupta,et al.  Theory of redundancy for logic circuits to maximize yield/area , 2012, Thirteenth International Symposium on Quality Electronic Design (ISQED).

[43]  Alfredo Benso,et al.  An on-line BIST RAM architecture with self-repair capabilities , 2002, IEEE Trans. Reliab..

[44]  T. N. Vijaykumar,et al.  Rescue: a microarchitecture for testability and defect tolerance , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[45]  Robert Aitken A modular wrapper enabling high speed BIST and repair for small wide memories , 2004 .

[46]  Dharma P. Agrawal,et al.  A Survey and Comparision of Fault-Tolerant Multistage Interconnection Networks , 1987, Computer.

[47]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.