Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins

Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of the most efficient methods to reduce power consumption of a chip. With the galloping adoption of hardware accelerators (i.e., GPUs and FPGAs) in large datacenters and other large-scale computing infrastructures, a comprehensive evaluation of the safe voltage reduction levels for each different chip can be employed for efficient reduction of the total power. We present a survey of recent studies in voltage margins reduction at the system level for modern CPUs, GPUs and FPGAs. The pessimistic voltage guardbands inserted by the silicon vendors can be exploited in all devices for significant power savings. On average, voltage reduction can reach 12% in multicore CPUs, 20% in manycore GPUs and 39% in FPGAs.

[1]  Dimitris Gizopoulos,et al.  Statistical Analysis of Multicore CPUs Operation in Scaled Voltage Conditions , 2018, IEEE Computer Architecture Letters.

[2]  Xiang Pan,et al.  VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[3]  Hamid Sarbazi-Azad,et al.  An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration , 2020, 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[4]  Eli Chiprout,et al.  A microarchitecture-based framework for pre- and post-silicon power delivery analysis , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Valerio Schiavoni,et al.  LEGaTO: towards energy-efficient, secure, fault-tolerant toolset for heterogeneous computing , 2018, CF.

[6]  Meeta Sharma Gupta,et al.  Voltage emergency prediction: Using signatures to reduce operating margins , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[8]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[9]  Lizy Kurian John,et al.  Automated di/dt stressmark generation for microprocessor power delivery networks , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[10]  Meeta Sharma Gupta,et al.  DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[11]  Jingwen Leng,et al.  Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture , 2014 .

[12]  Radu Teodorescu,et al.  Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[13]  Xuan Zhang,et al.  Efficient and Reliable Power Delivery in Voltage-Stacked Manycore System with Hybrid Charge-Recycling Regulators , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[14]  Shidhartha Das,et al.  Leveraging CPU Electromagnetic Emanations for Voltage Noise Characterization , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Linda L. Shen,et al.  Fast Voltage Transients on FPGAs: Impact and Mitigation Strategies , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[16]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Shidhartha Das,et al.  An energy-efficient and error-resilient server ecosystem exceeding conservative scaling limits , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Meeta Sharma Gupta,et al.  GPUVolt: Modeling and characterizing voltage noise in GPU architectures , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[19]  Meeta Sharma Gupta,et al.  Towards a software approach to mitigate voltage emergencies , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[20]  Tajana Simunic,et al.  Workload-Aware Opportunistic Energy Efficiency in Multi-FPGA Platforms , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[21]  Dimitris Gizopoulos,et al.  Adaptive Voltage/Frequency Scaling and Core Allocation for Balanced Energy and Performance on Multicore CPUs , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  Osman S. Unsal,et al.  Evaluating Built-In ECC of FPGA On-Chip Memories for the Mitigation of Undervolting Faults , 2019, 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).

[23]  Dimitris Gizopoulos,et al.  Voltage margins identification on commercial x86-64 multicore microprocessors , 2017, 2017 IEEE 23rd International Symposium on On-Line Testing and Robust System Design (IOLTS).

[24]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[25]  Jingwen Leng,et al.  Adaptive guardband scheduling to improve system-level efficiency of the POWER7+ , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Rakesh Kumar,et al.  Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[27]  Jose Nunez-Yanez,et al.  Energy Proportional Neural Network Inference with Adaptive Voltage and Frequency Scaling , 2019, IEEE Transactions on Computers.

[28]  Shidhartha Das,et al.  Measuring and Exploiting Guardbands of Server-Grade ARMv8 CPU Cores and DRAMs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[29]  Radu Teodorescu,et al.  Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors , 2013, ISCA.

[30]  Radu Teodorescu,et al.  Authenticache: Harnessing cache ECC for system authentication , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Shidhartha Das,et al.  GeST: An Automatic Framework For Generating CPU Stress-Tests , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[32]  Margaret Martonosi,et al.  Control techniques to eliminate voltage emergencies in high performance processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[33]  Lizy Kurian John,et al.  AUDIT: Stress Testing the Automatic Way , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[34]  T. N. Vijaykumar,et al.  Pipeline muffling and a priori current ramping: architectural techniques to reduce high-frequency inductive noise , 2003, ISLPED '03.

[35]  Osman S. Unsal,et al.  Fault Characterization Through FPGA Undervolting , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[36]  Vaughn Betz,et al.  Automatic BRAM Testing for Robust Dynamic Voltage Scaling for FPGAs , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[37]  Shidhartha Das,et al.  Harnessing Voltage Margins for Energy Efficiency in Multicore CPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[39]  Jingwen Leng,et al.  GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[40]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[41]  Linda L. Shen,et al.  Becoming More Tolerant: Designing FPGAs for Variable Supply Voltage , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[42]  Pradip Bose,et al.  Safe limits on voltage reduction efficiency in GPUs: A direct measurement approach , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Osman S. Unsal,et al.  Comprehensive Evaluation of Supply Voltage Underscaling in FPGA on-Chip Memories , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  Bishop Brock,et al.  Active management of timing guardband to save energy in POWER7 , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  Michael D. Smith,et al.  Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[46]  Behzad Salami,et al.  Aggressive undervolting of FPGAs : power & reliability trade-offs , 2018 .

[47]  Dimitris Gizopoulos,et al.  Micro-Viruses for Fast System-Level Voltage Margins Characterization in Multicore CPUs , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[48]  Jingwen Leng,et al.  Asymmetric Resilience for Accelerator-Rich Systems , 2019, IEEE Computer Architecture Letters.

[49]  Meeta Sharma Gupta,et al.  An event-guided approach to reducing voltage noise in processors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[50]  Valerio Schiavoni,et al.  LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing , 2018, SAMOS.

[51]  Xin He,et al.  Voltage-Stacked GPUs: A Control Theory Driven Cross-Layer Solution for Practical Voltage Stacking in GPUs , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[52]  Gu-Yeon Wei,et al.  Ivory: Early-stage design space exploration tool for integrated voltage regulators , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[53]  Günter Zimmer,et al.  Threshold-voltage sensitivity of ion-implanted m.o.s. transistors due to process variations , 1974 .