CoreRank: Redeeming “Sick Silicon” by Dynamically Quantifying Core-Level Healthy Condition

In field degradation of manycore processors poses a grand challenge to core management, largely because the degradation is hard to quantify. We propose a novel core-level degradation quantification scheme, CoreRank, to facilitate the management. We first develop a new degradation metric, called “healthy condition”, to capture the implication of performance degradation of a core with specific degraded components. Then, we propose a performance sampling scheme by using micro-operation streams, called snippet, to statistically quantify cores’ healthy condition. We find that similar snippets exhibit stable performance distribution, which makes them ideal micro-benchmarks to testify the core-level healthy conditions. We develop a hardware-implemented version of CoreRank based on bloom filter and hash table. Unlike the traditional “faulty” or “fault-free” judgement, CoreRank provides a key facility to make better use of those imperfect cores that suffered from various progressive aging mechanisms such as NBTI, HCI. Experimental results show that CoreRank successfully hides significant performance degradation of a defective manycore processor in which even more than half of the cores are salvaged from various defects.

[1]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[2]  Marc Girault,et al.  Hash-Functions Using Modulo-N Operations , 1987, EUROCRYPT.

[3]  Peter K. Pearson,et al.  Fast hashing of variable-length text strings , 1990, CACM.

[4]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[5]  James D. Meindl,et al.  Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration , 2002, IEEE J. Solid State Circuits.

[6]  D. Kwong,et al.  Dynamic NBTI of PMOS transistors and its impact on device lifetime , 2003, 2003 IEEE International Reliability Physics Symposium Proceedings, 2003. 41st Annual..

[7]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[8]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[9]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[11]  Karthick Rajamani,et al.  Thermal response to DVFS: analysis with an Intel Pentium M , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[12]  Coniferous softwood GENERAL TERMS , 2003 .

[13]  Shekhar Borkar Thousand Core ChipsA Technology Perspective , 2007, DAC 2007.

[14]  A. Choudhary,et al.  Mitigating the Effects of Process Variations : Architectural Approaches for Improving Batch Performance , 2007 .

[15]  Josep Torrellas,et al.  ReCycle:: pipeline adaptation to tolerate process variation , 2007, ISCA '07.

[16]  C.H. Kim,et al.  Silicon Odometer: An On-Chip Reliability Monitor for Measuring Frequency Degradation of Digital Circuits , 2007, 2007 IEEE Symposium on VLSI Circuits.

[17]  Sule Ozev,et al.  Online diagnosis of hard faults in microprocessors , 2007, TACO.

[18]  Sarita V. Adve,et al.  Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[19]  Josep Torrellas,et al.  Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors , 2008, 2008 International Symposium on Computer Architecture.

[20]  Josep Torrellas,et al.  Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[21]  Pradip Bose,et al.  A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime , 2008, 2008 International Symposium on Computer Architecture.

[22]  Sarita V. Adve,et al.  mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[24]  Diana Marculescu,et al.  Variation-aware dynamic voltage/frequency scaling , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[25]  Xiaowei Li,et al.  Variation-Aware Scheduling for Chip Multiprocessors with Thread Level Redundancy , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[26]  Gu-Yeon Wei,et al.  Revival: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency , 2009, IEEE Micro.

[27]  Stijn Eyerman,et al.  Interval simulation: Raising the level of abstraction in architectural simulation , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[28]  David Blaauw,et al.  A power-efficient 32b ARM ISA processor using timing-error detection and correction for transient-error tolerance and adaptation to PVT variation , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[29]  Xiaowei Li,et al.  Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors , 2010, ISCA '10.

[30]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[31]  Runsheng Wang,et al.  Towards the systematic study of aging induced dynamic variability in nano-MOSFETs: Adding the missing cycle-to-cycle variation effects into device-to-device variation , 2011, 2011 International Electron Devices Meeting.

[32]  Xiaowei Li,et al.  SVFD: A Versatile Online Fault Detection Scheme via Checking of Stability Violation , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[33]  Paolo A. Aseron,et al.  A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance , 2011, IEEE Journal of Solid-State Circuits.

[34]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[35]  Sherief Reda,et al.  Pack & Cap: Adaptive DVFS and thread packing under power caps , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Xiaowei Li,et al.  ReviveNet: A Self-Adaptive Architecture for Improving Lifetime Reliability via Localized Timing Adaptation , 2011, IEEE Transactions on Computers.

[37]  Krishna K. Rangan,et al.  Achieving uniform performance and maximizing throughput in the presence of heterogeneity , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[38]  David Blaauw,et al.  A Power-Efficient 32 bit ARM Processor Using Timing-Error Detection and Correction for Transient-Error Tolerance and Adaptation to PVT Variation , 2011, IEEE Journal of Solid-State Circuits.

[39]  Lieven Eeckhout,et al.  Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[40]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[41]  Minyi Guo,et al.  AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[42]  Christine A. Shoemaker,et al.  Flicker: a dynamically adaptive architecture for power limited multicore systems , 2013, ISCA.

[43]  Tianshi Chen,et al.  Statistical Performance Comparisons of Computers , 2012, IEEE Transactions on Computers.