Overcoming Hard-Faults in High-Performance Microprocessors

As device density grows, each transistor gets smaller and more fragile leading to an overall higher susceptibility to hard-faults. These hard-faults result in permanent silicon defects and impact manufacturing yield, performance, and lifetime of semiconductor devices. In this thesis, we propose comprehensive, low-cost solutions to tackle reliability problems in high-performance microprocessors. These microprocessors mainly consist of on-chip caches and core pipeline. We first present two flexible cache architectures, ZerehCache and Archipelago, to protect regular SRAM structures against high failure rates. ZerehCache virtually reorganizes the cache data array using a permutation network to provide higher degrees of freedom for spare allocation. In order to study the impact of fault patterns on the redundancy requirements in a cache, we propose a methodology to model the collision patterns in caches as a graph problem. Given this model, a graph coloring scheme is employed to minimize the amount of additional redundancy required for protecting the cache. Archipelago targets failures in near-threshold region. It resizes the cache to provide redundancy for repairing faulty cells. Furthermore, a near optimal minimum clique covering configuration algorithm is introduced to minimizes the cache capacity loss. With proper solutions in place for caches, a robust and heterogeneous core coupling execution scheme, Necromancer, is presented to protect the general core area against hard-faults. Although a faulty core cannot be trusted, we observe that for most defects, execution traces on a defective core coarsely resemble those of fault-free executions. Necromancer exploits a functionally dead core to improve system throughput by supplying hints regarding high-level program behavior. We partition the cores into multiple groups. Each group shares a lightweight core that can be substantially accelerated. However, due to the presence of defects, a perfect data or instruction stream cannot be provided by the dead core. This necessitates employing low-cost recovery mechanism and generic hints that are more resilient to local abnormalities.

[1]  Gu-Yeon Wei,et al.  Replacing 6T SRAMs with 3T1D DRAMs in the L1 Data Cache to Combat Process Variability , 2008, IEEE Micro.

[2]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[3]  Amin Ansari,et al.  Maximizing Spare Utilization by Virtually Reorganizing Faulty Cache Lines , 2011, IEEE Transactions on Computers.

[4]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[5]  Ruby B. Lee,et al.  Fast subword permutation instructions based on butterfly network , 1999, Electronic Imaging.

[6]  Amin Ansari,et al.  StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems , 2008, CASES '08.

[7]  Hai Zhou,et al.  Yield-Aware Cache Architectures , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[9]  Amin Ansari,et al.  Enabling ultra low voltage system operation by tolerating on-chip cache failures , 2009, ISLPED.

[10]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[11]  Sule Ozev,et al.  A mechanism for online diagnosis of hard faults in microprocessors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[12]  S. Hanamura,et al.  A 15-ns 1-Mbit CMOS SRAM , 1988 .

[13]  B.C. Paul,et al.  Process variation in embedded memories: failure analysis and variation aware architecture , 2005, IEEE Journal of Solid-State Circuits.

[14]  S. Zafar,et al.  A Model for Negative Bias Temperature Instability in Oxide and High κ pFETs , 2007, 2007 IEEE International Conference on Integrated Circuit Design and Technology.

[15]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[16]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[17]  A. Chandrakasan,et al.  A 256kb Sub-threshold SRAM in 65nm CMOS , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[18]  Necromancer: enhancing system throughput by animating dead cores , 2010, ISCA '10.

[19]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[20]  A.P. Chandrakasan,et al.  A 256 kb 65 nm 8T Subthreshold SRAM Employing Sense-Amplifier Redundancy , 2008, IEEE Journal of Solid-State Circuits.

[21]  Irwin L. Kellner TURN DOWN THE HEAT , 1995 .

[22]  T. N. Vijaykumar,et al.  Rescue: a microarchitecture for testability and defect tolerance , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[23]  Anna W. Topol,et al.  Stable SRAM cell design for the 32 nm node and beyond , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Technology, 2005..

[24]  Aviral Shrivastava,et al.  Temperature and Process Variations Aware Power Gating of Functional Units , 2008, 21st International Conference on VLSI Design (VLSID 2008).

[25]  Kevin Skadron,et al.  HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects , 2003 .

[26]  Amin Ansari,et al.  ZerehCache: Armoring cache architectures in high defect density technologies , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Daniel J. Sorin,et al.  Choosing an Error Protection Scheme for a Microprocessor's L1 Data Cache , 2006, 2006 International Conference on Computer Design.

[28]  Farshad Moradi,et al.  65NM sub-threshold 11T-SRAM for ultra low voltage applications , 2008, 2008 IEEE International SOC Conference.

[29]  R. Hokinson,et al.  Implementation of an Alpha microprocessor in SOI , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[30]  Kenneth M. Thompson Intel and the Myths of Test , 1996, IEEE Des. Test Comput..

[31]  TeodorescuRadu,et al.  Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors , 2008 .

[32]  Josep Torrellas,et al.  Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[33]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[34]  S. Winkel Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[35]  Richard J. Carter,et al.  Defect tolerance on the Teramac custom computer , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[36]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[37]  K. Roy,et al.  A 160 mV Robust Schmitt Trigger Based Subthreshold SRAM , 2007, IEEE Journal of Solid-State Circuits.

[38]  Sanjay J. Patel,et al.  Y-branches: when you come to a fork in the road, take it , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[39]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[40]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[41]  A. Christou Electromigration and electronic device degradation , 1994 .

[42]  A. Naor,et al.  The two possible values of the chromatic number of a random graph , 2005 .

[43]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[44]  Amin Ansari,et al.  Putting Faulty Cores to Work , 2010, IEEE Micro.

[45]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, Proceedings 21st International Conference on Computer Design.

[46]  K. Takeda,et al.  A read-static-noise-margin-free SRAM cell for low-V/sub dd/ and high-speed applications , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[47]  Trevor Mudge,et al.  Combined dynamic voltage scaling and adaptive body biasing for lower power microprocessors under dynamic workloads , 2002, ICCAD 2002.

[48]  Shunsuke Okumura,et al.  A 7T/14T Dependable SRAM and its Array Structure to Avoid Half Selection , 2009, 2009 22nd International Conference on VLSI Design.

[49]  Dhiraj K. Pradhan,et al.  A Routing-Aware ILS Design Technique , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[50]  Avi Wigderson,et al.  Improving the performance guarantee for approximate graph coloring , 1983, JACM.

[51]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[52]  Sanjay J. Patel,et al.  Beating in-order stalls with "flea-flicker" two-pass pipelining , 2006, IEEE transactions on computers.

[53]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[54]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[55]  David Bol,et al.  Analysis and minimization of practical energy in 45nm subthreshold logic circuits , 2008, 2008 IEEE International Conference on Computer Design.

[56]  A.P. Chandrakasan,et al.  A 256-kb 65-nm Sub-threshold SRAM Design for Ultra-Low-Voltage Operation , 2007, IEEE Journal of Solid-State Circuits.

[57]  Kaushik Roy,et al.  A feasibility study of subthreshold SRAM across technology generations , 2005, 2005 International Conference on Computer Design.

[58]  Y. Moriyama,et al.  A 0.9 V operation 2-transistor flash memory for embedded logic LSIs , 1999, 1999 Symposium on VLSI Technology. Digest of Technical Papers (IEEE Cat. No.99CH36325).

[59]  Bonnie Berger,et al.  A better performance guarantee for approximate graph coloring , 2005, Algorithmica.

[60]  Gu-Yeon Wei,et al.  Architecture and circuit techniques for low-throughput, energy-constrained systems across technology generations , 2006, CASES '06.

[61]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[62]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[63]  Shuichi Sakai,et al.  SEVA: A Soft-Error- and Variation-Aware Cache Architecture , 2006, 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06).

[64]  Kaushik Roy,et al.  A process-tolerant cache architecture for improved yield in nanoscale technologies , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[65]  T. N. Vijaykumar,et al.  Rescue: A Microarchitecture for Testability and Defect Tolerance , 2005, ISCA 2005.

[66]  M. Horiguchi,et al.  Redundancy techniques for high-density DRAMs , 1997, 1997 Proceedings Second Annual IEEE International Conference on Innovative Systems in Silicon.

[67]  J. Torrellas,et al.  VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects , 2008, IEEE Transactions on Semiconductor Manufacturing.

[68]  Sule Ozev,et al.  Tolerating hard faults in microprocessor array structures , 2004, International Conference on Dependable Systems and Networks, 2004.

[69]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[70]  Kaushik Roy,et al.  Process Variations and Process-Tolerant Design , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[71]  David Nassimi A self routing Benes network , 1980, ISCA '80.

[72]  A. Gebremedhin Parallel Graph Coloring , 1999 .

[73]  Amin Ansari,et al.  The StageNet fabric for constructing resilient multicore systems , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[74]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[75]  Lynn Youngs,et al.  Mapping and Repairing Embedded-Memory Defects , 1997, IEEE Des. Test Comput..

[76]  Masahiro Nomura,et al.  A read-static-noise-margin-free SRAM cell for low-VDD and high-speed applications , 2006, IEEE Journal of Solid-State Circuits.

[77]  Josep Torrellas,et al.  Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[78]  Israel Koren,et al.  Incorporating Yield Enhancement into the Floorplanning Process , 2000, IEEE Trans. Computers.

[79]  Sandip Kundu,et al.  Trends in manufacturing test methods and their implications , 2004, 2004 International Conferce on Test.

[80]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[81]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[82]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[83]  Trevor Mudge,et al.  On-Chip Cache Device Scaling Limits and Effective Fault Repair Techniques in Future Nanoscale Technology , 2007 .

[84]  Ke Meng,et al.  Process Variation Aware Cache Leakage Management , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.

[85]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[86]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[87]  Cristopher Moore,et al.  The Chromatic Number of Random Regular Graphs , 2004, APPROX-RANDOM.

[88]  Béla Bollobás,et al.  The chromatic number of random graphs , 1988, Comb..

[89]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[90]  Kevin Skadron,et al.  A Case for Thermal-Aware Floorplanning at the Microarchitectural Level , 2005, J. Instr. Level Parallelism.

[91]  Walter Klotz Graph Coloring Algorithms , 2002 .

[92]  Josep Torrellas,et al.  Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors , 2008, 2008 International Symposium on Computer Architecture.

[93]  Trevor Mudge,et al.  Yield-driven near-threshold SRAM design , 2007, ICCAD 2007.

[94]  Jaume Abella,et al.  Low Vccmin fault-tolerant cache with highly predictable performance , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[95]  Ieee Circuits,et al.  IEEE Transactions on Very Large Scale Integration (VLSI) Systems , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[96]  Kaushik Roy,et al.  Modeling of failure probability and statistical design of SRAM array for yield enhancement in nanoscaled CMOS , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[97]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[98]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[99]  J. F. Ziegler,et al.  Terrestrial cosmic ray intensities , 1998, IBM J. Res. Dev..

[100]  Yong-Bin Kim,et al.  SRAM word-oriented redundancy methodology using built in self-repair , 2004, IEEE International SOC Conference, 2004. Proceedings..

[101]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[102]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2008, IEEE Micro.

[103]  Daniel J. Sorin,et al.  Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[104]  Ruby B. Lee,et al.  Implementation complexity of bit permutation instructions , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[105]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[106]  Jordi Suñé,et al.  Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin gate oxides , 2002 .

[107]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[108]  Amin Ansari,et al.  StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs , 2011, IEEE Transactions on Computers.

[109]  H. Fujiwara,et al.  An Area-Conscious Low-Voltage-Oriented 8T-SRAM Design under DVS Environment , 2007, 2007 IEEE Symposium on VLSI Circuits.

[110]  Amin Ansari,et al.  Archipelago: A polymorphic cache design for enabling robust near-threshold operation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[111]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.