Thermal-Cycling-aware Dynamic Reliability Management in Many-Core System-on-Chip

Dynamic Reliability Management (DRM) is a common approach to mitigate aging and wear-out effects in multi- /many-core systems. State-of-the-art DRM approaches apply finegrained control on resource management to increase/balance the chip reliability while considering other system constraints, e.g., performance, and power budget. Such approaches, acting on various knobs such as workload mapping and scheduling, Dynamic Voltage/Frequency Scaling (DVFS) and Per-Core Power Gating (PCPG), demonstrated to work properly with the various aging mechanisms, such as electromigration, and Negative-Bias Temperature Instability (NBTI). However, we claim that they do not suffice for thermal cycling. Thus, we here propose a novel thermal-cycling-aware DRM approach for shared-memory many-core systems running multi-threaded applications. The approach applies a fine-grained control capable at reducing both temperature levels and variations. The experimental evaluations demonstrated that the proposed approach is able to achieve 39% longer lifetime than past approaches.

[1]  Axel Jantsch,et al.  MapPro: Proactive Runtime Mapping for Dynamic Workloads by Quantifying Ripple Effect of Applications on Networks-on-Chip , 2015, NOCS.

[2]  Reetuparna Das,et al.  Application-to-core mapping policies to reduce memory interference in multi-core systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Xiaobo Sharon Hu,et al.  An on-line framework for improving reliability of real-time systems on “big-little” type MPSoCs , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[4]  David Blaauw,et al.  Multi-Mechanism Reliability Modeling and Management in Dynamic Systems , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Kai Ma,et al.  PGCapping: Exploiting power gating for power capping and core lifetime balancing in CMPs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[7]  Kevin Skadron,et al.  Dark vs. Dim Silicon and Near-Threshold Computing , 2013 .

[8]  Axel Jantsch,et al.  Reliability-Aware Runtime Power Management for Many-Core Systems in the Dark Silicon Era , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  Li Shang,et al.  System-level reliability modeling for MPSoCs , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[10]  Hannu Tenhunen,et al.  Performance/Reliability-Aware Resource Management for Many-Cores in Dark Silicon Era , 2017, IEEE Transactions on Computers.

[11]  Xiaobo Sharon Hu,et al.  Enhancing multicore reliability through wear compensation in online assignment and scheduling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  AbabeiCristinel,et al.  Unified reliability estimation and management of NoC based chip multiprocessors , 2014 .

[13]  Bharadwaj Veeravalli,et al.  Run-time mapping for reliable many-cores based on energy/performance trade-offs , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[14]  Donald E. Thomas,et al.  Lifetime improvement through runtime wear-based task mapping , 2012, CODES+ISSS '12.

[15]  Amit Kumar Singh,et al.  Life Guard: A Reinforcement Learning-Based Task Mapping Strategy for Performance-Centric Aging Management , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[16]  Muhammad Shafique,et al.  Hayat: Harnessing Dark Silicon and variability for aging deceleration and balancing , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[18]  Xiaobo Sharon Hu,et al.  Improving System-Level Lifetime Reliability of Multicore Soft Real-Time Systems , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[19]  Amit Kumar Singh,et al.  HiMap: A hierarchical mapping approach for enhancing lifetime reliability of dark silicon manycore systems , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[20]  Cristinel Ababei,et al.  Unified reliability estimation and management of NoC based chip multiprocessors , 2014, Microprocess. Microsystems.

[21]  Tajana Simunic,et al.  Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors , 2009, SIGMETRICS '09.