论文信息 - Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study

Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study

Self-repair replaces/bypasses faulty components in a system-on-chip (SoC) to keep the system functioning correctly even in the presence of permanent faults. Such faults may result from early-life failures, circuit aging, and manufacturing defects and variations. Unlike on-chip memories, processor cores, and networks-on-chip, little attention has been paid to self-repair of uncore components (e.g., cache controllers, memory controllers, and I/O controllers) that occupy significant portions of multi-core SoCs. In this paper, we present new techniques that utilize architectural features to achieve self-repair of uncore components while incurring low area, power, and performance costs. We demonstrate the effectiveness and practicality of our techniques, using the industrial OpenSPARC T2 SoC with 8 processor cores that support 64 hardware threads. Our key results are: 1. Our techniques enable effective self-repair of any single faulty uncore component with 7.5% post-layout chip-level area impact and 3% power impact. In contrast, existing redundancy techniques impose high (e.g., 16%) area costs. Our techniques do not incur any performance impact in fault-free systems. In the presence of a single faulty uncore component, there can be a 5% application performance impact. 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with graceful degradation of application performance. 3. Our techniques achieve high self-repair coverage of 97.5% in the presence of a single fault. Our self-repair techniques also enable flexible tradeoffs between self-repair coverage and area costs. For example, 75% self-repair coverage can be achieved with 3.2% post-layout chip-level area impact.

[1] Edward J. McCluskey,et al. PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[2] Coniferous softwood. GENERAL TERMS , 2003 .

[3] Hyunki Kim,et al. Low-cost gate-oxide early-life failure detection in robust systems , 2010, 2010 Symposium on VLSI Circuits.

[4] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[5] Onur Mutlu,et al. Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6] Melvin A. Breuer,et al. Roving Emulation as a Fault Detection Mechanism , 1986, IEEE Transactions on Computers.

[7] Doug Burger,et al. Exploiting microarchitectural redundancy for defect tolerance , 2003, Proceedings 21st International Conference on Computer Design.

[8] Edward J. McCluskey,et al. Which concurrent error detection scheme to choose ? , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).

[9] Janusz Rajski,et al. A Rapid Yield Learning Flow Based on Production Integrated Layout-Aware Diagnosis , 2006, 2006 IEEE International Test Conference.

[10] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11] Mario Schölzel,et al. Fine-Grained Software-Based Self-Repair of VLIW Processors , 2011, 2011 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems.

[12] Howard Jay Siegel,et al. A survey and comparison of fault-tolerant multistage interconnection networks , 1994 .

[13] U. Schlichtmann,et al. Goldilocks failures: Not too soft, not too hard , 2012, 2012 IEEE International Reliability Physics Symposium (IRPS).

[14] Subhasish Mitra,et al. CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns , 2008, 2008 Design, Automation and Test in Europe.

[15] Josep Torrellas,et al. Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[16] Lieven Eeckhout,et al. Automated microprocessor stressmark generation , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[17] David Blaauw,et al. Compact In-Situ Sensors for Monitoring Negative-Bias-Temperature-Instability Effect and Oxide Degradation , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[18] Daniel J. Sorin,et al. Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19] Gérard Memmi,et al. A reconfigurable design-for-debug infrastructure for SoCs , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[20] Christoforos E. Kozyrakis,et al. The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[21] R. D. Blanton,et al. On-chip diagnosis for early-life and wear-out failures , 2012, 2012 IEEE International Test Conference.

[22] Santosh G. Abraham,et al. Effective instruction prefetching in chip multiprocessors for modern commercial applications , 2005, 11th International Symposium on High-Performance Computer Architecture.

[23] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[24] Shekhar Y. Borkar,et al. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[25] Mohammad Mirza-Aghatabar,et al. A design flow to maximize yield/area of physical devices via redundancy , 2012, 2012 IEEE International Test Conference.

[26] S. Pae,et al. Random charge effects for PMOS NBTI in ultra-small gate area devices , 2005, 2005 IEEE International Reliability Physics Symposium, 2005. Proceedings. 43rd Annual..

[27] Paul R. Turgeon,et al. Two approaches to array fault tolerance in the IBM Enterprise System/9000 Type 9121 processor , 1991, IBM J. Res. Dev..

[28] Shantanu Gupta,et al. Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[29] Ming Zhang,et al. Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[30] Onur Mutlu,et al. Concurrent autonomous self-test for uncore components in system-on-chips , 2010, 2010 28th VLSI Test Symposium (VTS).

[31] Yervant Zorian,et al. Embedded-memory test and repair: infrastructure IP for SoC yield , 2003, IEEE Design & Test of Computers.

[32] J. Hicks. 45nm Transistor Reliability , 2008 .

[33] Jody Van Horn. Towards achieving relentless reliability gains in a server marketplace of teraflops, laptops, kilowatts, and "cost, cost, cost"...: making peace between a black art and the bottom line , 2005, ITC.

[34] Wei Chen,et al. The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series , 2007, IEEE Journal of Solid-State Circuits.

[35] Chris Allsup. Is built-in logic redundancy ready for prime time? , 2010, 2010 11th International Symposium on Quality Electronic Design (ISQED).

[36] Stephen P. Boyd,et al. Self-Tuning for Maximized Lifetime Energy-Efficiency in the Presence of Circuit Aging , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[37] Antonio Robles,et al. An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori , 2004, IEEE Computer Architecture Letters.

[38] Steffen Paul,et al. Memory built-in self-repair using redundant words , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[39] J. A. Cunningham. The use and evaluation of yield models in integrated circuit manufacturing , 1990 .

[40] Yervant Zorian,et al. Built in self repair for embedded high density SRAM , 1998, Proceedings International Test Conference 1998 (IEEE Cat. No.98CH36270).

[41] Josep Torrellas,et al. ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[42] S. Gupta,et al. Theory of redundancy for logic circuits to maximize yield/area , 2012, Thirteenth International Symposium on Quality Electronic Design (ISQED).

[43] Alfredo Benso,et al. An on-line BIST RAM architecture with self-repair capabilities , 2002, IEEE Trans. Reliab..

[44] T. N. Vijaykumar,et al. Rescue: a microarchitecture for testability and defect tolerance , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[45] Robert Aitken. A modular wrapper enabling high speed BIST and repair for small wide memories , 2004 .

[46] Dharma P. Agrawal,et al. A Survey and Comparision of Fault-Tolerant Multistage Interconnection Networks , 1987, Computer.

[47] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.