Long term sustainability of differentially reliable systems in the dark silicon era

As transistor miniaturization continues, providing robustness and computational correctness comes with rising power, performance, and area overhead costs. However, the diversity of software error tolerance is increasing as modern society embraces ubiquitous computing. This diversity can be exploited by differentially reliable (DR) multicore systems. The rising level of dark silicon-the portion of a chip that must remain inactive due to power budget constraints-makes such DR systems even more attractive when compared to homogeneous designs because power efficiency is improved with the increased flexibility of dynamically selecting appropriate cores for a given software workload. However, ensuring the long-term sustainability of these DR systems is a profound challenge. Asymmetric utilization of cores, differential aging degradation, and manufacturing process variation alter the relative reliability of DR system components, degrading and even eliminating the energy efficiency advantage. In this paper, we propose a feedback control based thread-to-core mapping framework to ensure longterm sustainability and extend the energy efficiency of a DR system. Over a ten-year lifespan, we analyze our approach on two DR design techniques and respectively demonstrate 14.4-16.3% and 26.1-31.0% in sustained energy-efficiency benefits, surpassing the recently proposed race-to-idle approach.

[1]  Quinn Jacobson,et al.  ERSA: error resilient system architecture for probabilistic applications , 2010, DATE 2010.

[2]  Petru Eles,et al.  Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[3]  Enrico Macii,et al.  NBTI-aware power gating for concurrent leakage and aging optimization , 2009, ISLPED.

[4]  Naresh R. Shanbhag,et al.  Energy-efficient soft error-tolerant digital signal processing , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[6]  Antonio Ortega,et al.  Analysis and testing for error tolerant motion estimation , 2005, 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'05).

[7]  Sanghamitra Roy,et al.  Proactive aging management in heterogeneous NoCs through a criticality-driven routing approach , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  Karthikeyan Sankaralingam,et al.  Dark silicon and the end of multicore scaling , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[9]  Sarita V. Adve,et al.  Exploiting Structural Duplication for Lifetime Reliability Enhancement , 2005, ISCA 2005.

[10]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[11]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[12]  Bernd Becker,et al.  A study of cognitive resilience in a JPEG compressor , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[13]  Jason Mars,et al.  A cross-layer approach to heterogeneity and reliability , 2009, 2009 7th IEEE/ACM International Conference on Formal Methods and Models for Co-Design.

[14]  Lu Peng,et al.  Lighting the dark silicon by exploiting heterogeneity on future processors , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Yu Cao,et al.  Predictive Modeling of the NBTI Effect for Reliable Design , 2006, IEEE Custom Integrated Circuits Conference 2006.

[16]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[17]  David Blaauw,et al.  Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[18]  Sudhakar Yalamanchili,et al.  Reliability Implications of Power / Thermal Constrained Operation in Asymmetric Multicore Processors , 2012 .

[19]  Babak Falsafi,et al.  Toward Dark Silicon in Servers , 2011, IEEE Micro.

[20]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[21]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[22]  Michael Bedford Taylor,et al.  Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.

[23]  Vikram Bhatt,et al.  GreenDroid: An architecture for the Dark Silicon Age , 2012, 17th Asia and South Pacific Design Automation Conference.

[24]  John Sartori,et al.  Designing a processor from the ground up to allow voltage/reliability tradeoffs , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[25]  Kaushik Roy,et al.  Analysis and characterization of inherent application resilience for approximate computing , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[26]  Antonio Ortega,et al.  Hardware testing for error tolerant multimedia compression based on linear transforms , 2005, 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'05).

[27]  Alain Girault,et al.  Tradeoff exploration between reliability, power consumption, and execution time for embedded systems , 2011, International Journal on Software Tools for Technology Transfer.

[28]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[29]  David M. Bull,et al.  RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[30]  Eric Rotenberg,et al.  FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[31]  Fan Yang,et al.  Statistical reliability analysis under process variation and aging effects , 2009, 2009 46th ACM/IEEE Design Automation Conference.