Performance/Reliability-Aware Resource Management for Many-Cores in Dark Silicon Era

Aggressive technology scaling has enabled the fabrication of many-core architectures while triggering challenges such as limited power budget and increased reliability issues, like aging phenomena. Dynamic power management and runtime mapping strategies can be utilized in such systems to achieve optimal performance while satisfying power constraints. However, lifetime reliability is generally neglected. We propose a novel lifetime reliability/performance-aware resource co-management approach for many-core architectures in the dark silicon era. The approach is based on a two-layered architecture, composed of a long-term runtime reliability controller and a short-term runtime mapping and resource management unit. The former evaluates the cores’ aging status w.r.t. a target reference specified by the designer, and performs recovery actions on highly stressed cores by means of power capping. The aging status is utilized in runtime application mapping to maximize system performance while fulfilling reliability requirements and honoring the power budget. Experimental evaluation demonstrates the effectiveness of the proposed strategy, which outperforms most recent state-of-the-art contributions.

[1]  Axel Jantsch,et al.  Reliability-Aware Runtime Power Management for Many-Core Systems in the Dark Silicon Era , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Salvatore Monteleone,et al.  Noxim: An open, extensible and cycle-accurate network on chip simulator , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[3]  Pasi Liljeberg,et al.  Smart hill climbing for agile dynamic mapping in many-core systems , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  Marco Gribaudo,et al.  A lightweight and open-source framework for the lifetime estimation of multicore systems , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[5]  Donald E. Thomas,et al.  Lifetime improvement through runtime wear-based task mapping , 2012, CODES+ISSS '12.

[6]  Sheldon X.-D. Tan,et al.  Lifetime optimization for real-time embedded systems considering electromigration effects , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[7]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[8]  Qiang Xu,et al.  Characterizing the lifetime reliability of manycore processors with core-level redundancy , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[9]  Gianluca Palermo,et al.  Voltage island management in near threshold manycore architectures to mitigate dark silicon , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Cristinel Ababei,et al.  Unified reliability estimation and management of NoC based chip multiprocessors , 2014, Microprocess. Microsystems.

[11]  David Blaauw,et al.  Multi-Mechanism Reliability Modeling and Management in Dynamic Systems , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[12]  Li Shang,et al.  System-level reliability modeling for MPSoCs , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[13]  Sheldon X.-D. Tan,et al.  Learning-based dynamic reliability management for dark silicon processor considering EM effects , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Kai Ma,et al.  PGCapping: Exploiting power gating for power capping and core lifetime balancing in CMPs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Roman L. Lysecky,et al.  Workload assignment considering NBTI degradation in multicore systems , 2014, ACM J. Emerg. Technol. Comput. Syst..

[16]  Huan Liu,et al.  A Measurement Study of Server Utilization in Public Clouds , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[17]  Bharadwaj Veeravalli,et al.  Run-time mapping for reliable many-cores based on energy/performance trade-offs , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[18]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[19]  Timothy G. Mattson,et al.  Light-weight communications on Intel's single-chip cloud computer processor , 2011, OPSR.

[20]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[21]  Axel Jantsch,et al.  MapPro: Proactive Runtime Mapping for Dynamic Workloads by Quantifying Ripple Effect of Applications on Networks-on-Chip , 2015, NOCS.

[22]  Luca Benini,et al.  Dynamic variability management in mobile multicore processors under lifetime constraints , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[23]  Hannu Tenhunen,et al.  A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Bharadwaj Veeravalli,et al.  Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25]  Tajana Simunic,et al.  Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors , 2009, SIGMETRICS '09.

[26]  Fernando Gehm Moraes,et al.  Heuristics for Dynamic Task Mapping in NoC-based Heterogeneous MPSoCs , 2007, 18th IEEE/IFIP International Workshop on Rapid System Prototyping (RSP '07).

[27]  Axel Jantsch,et al.  Dark silicon aware power management for manycore systems under dynamic workloads , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[28]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Pasi Liljeberg,et al.  CoNA: Dynamic application mapping for congestion reduction in many-core systems , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[30]  Josep Torrellas,et al.  The BubbleWrap many-core: Popping cores for sequential acceleration , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Muhammad Shafique,et al.  Hayat: Harnessing Dark Silicon and variability for aging deceleration and balancing , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[32]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[33]  Xiaobo Sharon Hu,et al.  Enhancing multicore reliability through wear compensation in online assignment and scheduling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).