Predictable GPUs Frequency Scaling for Energy and Performance

Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies. The possibility to manually set these frequencies is a great opportunity for application tuning, which can focus on the best application-dependent setting. However, this task is not straightforward because of the large set of possible configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This paper proposes a method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. Our modeling approach, based on machine learning, first predicts speedup and normalized energy over the default frequency configuration. Then, it combines the two models into a multi-objective one that predicts a Pareto-set of frequency configurations. The approach uses static code features, is built on a set of carefully designed micro-benchmarks, and can predict the best frequency settings of a new kernel without executing it. Test results show that our modeling approach is very accurate on predicting extrema points and Pareto set for ten out of twelve test benchmarks, and discover frequency configurations that dominate the default configuration in either energy or performance.

[1]  John Kim,et al.  Energy-efficient scheduling for memory-intensive GPGPU workloads , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Satoshi Matsuoka,et al.  Power-aware dynamic task scheduling for heterogeneous accelerated clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[3]  Viktor K. Prasanna,et al.  GPU-Accelerated Parameter Optimization for Classification Rule Learning , 2016, FLAIRS Conference.

[4]  Marco Laumanns,et al.  Performance assessment of multiobjective optimizers: an analysis and review , 2003, IEEE Trans. Evol. Comput..

[5]  Xinxin Mei,et al.  A measurement study of GPU DVFS on energy conservation , 2013, HotPower '13.

[6]  Richard W. Vuduc,et al.  Analyzing the Energy Efficiency of the Fast Multipole Method Using a DVFS-Aware Energy Model , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Frederico Pratas,et al.  Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[8]  Eckart Zitzler,et al.  Evolutionary algorithms for multiobjective optimization: methods and applications , 1999 .

[9]  Andreas Krause,et al.  e-PAL: An Active Learning Approach to the Multi-Objective Optimization Problem , 2016, J. Mach. Learn. Res..

[10]  Wei Chen,et al.  GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures , 2012, 2012 41st International Conference on Parallel Processing.

[11]  Jie Shen,et al.  An application-centric evaluation of OpenCL on multi-core CPUs , 2013, Parallel Comput..

[12]  Xin Yao,et al.  Many-Objective Evolutionary Algorithms , 2015, ACM Comput. Surv..

[13]  KimHyesoon,et al.  OpenCL performance evaluation on modern multicore CPUs , 2016 .

[14]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[15]  Saurabh Dighe,et al.  A 280mV-to-1.2V wide-operating-range IA-32 processor in 32nm CMOS , 2012, 2012 IEEE International Solid-State Circuits Conference.

[16]  Christopher C. Cummins,et al.  Synthesizing benchmarks for predictive modeling , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Qiang Wang,et al.  GPGPU Performance Estimation with Core and Memory Frequency Scaling , 2017, 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS).

[18]  Hiroshi Sasaki,et al.  Power and Performance Characterization and Modeling of GPU-Accelerated Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[19]  Margaret Martonosi,et al.  Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data , 2003, MICRO.

[20]  Rong Ge,et al.  Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU , 2013, 2013 42nd International Conference on Parallel Processing.

[21]  Hyesoon Kim,et al.  OpenCL Performance Evaluation on Modern Multi Core CPUs , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[22]  Ben H. H. Juurlink,et al.  Autotuning Stencil Computations with Structural Ordinal Regression Learning , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Markus Püschel,et al.  Offline library adaptation using automatically generated heuristics , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[24]  Rong Ge,et al.  Modeling and evaluating energy-performance efficiency of parallel processing on multicore based power aware systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[25]  Ananta Tiwari,et al.  Green Queue: Customized Large-Scale Clock Frequency Scaling , 2012, 2012 Second International Conference on Cloud and Green Computing.

[26]  Thomas Fahringer,et al.  An automatic input-sensitive approach for heterogeneous task partitioning , 2013, ICS '13.

[27]  Michael M. Swift,et al.  Rinnegan: Efficient resource use in heterogeneous architectures , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[28]  Torsten Hoefler,et al.  Using Compiler Techniques to Improve Automatic Performance Modeling , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[29]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[30]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[31]  Nuno Roma,et al.  GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[32]  David Black-Schaffer,et al.  Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics , 2016, IEEE Transactions on Computers.