Systematic derivation of time and power models for linear algebra kernels on multicore architectures

Abstract The power wall asks for a holistic effort from the high performance and scientific communities to develop power-aware tools and applications which ultimately drive the design of energy-efficient hardware. Toward this goal, we introduce a systematic methodology to derive reliable time and power models for algebraic kernels employing a bottom-up approach. This strategy helps to understand the contribution of the different kernels to the total energy consumption of applications, as well as to distinguish between the cost of fine-grain components such as arithmetic, memory access, and overheads introduced by, e.g., multithreading or reductions. To study and validate our methodology, we initially focus on two key memory-bound BLAS-1 vector kernels: the dot product and the axpy operation. Subsequently, we show how these kernels can be composed to accurately predict the energy consumption of more heterogeneous algorithms, such as the Conjugate Gradient method, while tackling the elaborate memory hierarchy and the high degree of concurrency of today's processors; in particular, the evaluation of the models on the IBM ® Blue Gene/Q supercomputer, as well as on the IBM ® Power 755 server, reveals that average power consumption is captured at high accuracy, yet the models and the methodology are universal to be portable to any general-purpose multicore architecture.

[1]  Enrique S. Quintana-Ortí,et al.  Exploring large macromolecular functional motions on clusters of multicore processors , 2013, J. Comput. Phys..

[2]  Sharad Malik,et al.  Power analysis of embedded software: a first step towards software power minimization , 1994, IEEE Trans. Very Large Scale Integr. Syst..

[3]  E. Kaltofen The “Seven Dwarfs” of Symbolic Computation , 2012 .

[4]  Richard W. Vuduc,et al.  Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[5]  G. D. Peterson,et al.  Power Aware Computing on GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[6]  Jesús Labarta,et al.  Tools for Power-Energy Modelling and Analysis of Parallel Scientific Applications , 2012, 2012 41st International Conference on Parallel Processing.

[7]  Zhiling Lan,et al.  Measuring Power Consumption on IBM Blue Gene/Q , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[8]  Pavel Klavík,et al.  Changing computing paradigms towards power efficiency , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[9]  Stefanos Kaxiras,et al.  Interval-based models for run-time DVFS orchestration in superscalar processors , 2010, CF '10.

[10]  Dimitrios S. Nikolopoulos,et al.  BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[11]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[12]  Kirk W. Cameron,et al.  E-AMOM: an energy-aware modeling and optimization methodology for scientific applications , 2014, Computer Science - Research and Development.

[13]  Satoshi Matsuoka,et al.  Statistical power modeling of GPU kernels using performance counters , 2010, International Conference on Green Computing.

[14]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[15]  Xiaorui Wang,et al.  Server-Level Power Control , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[16]  Chenjie Yu,et al.  Evaluating Power-Monitoring Capabilities on IBM Blue Gene/P and Blue Gene/Q , 2012, 2012 IEEE International Conference on Cluster Computing.

[17]  Eduard Ayguadé,et al.  A Systematic Methodology to Generate Decomposable and Responsive Power Models for CMPs , 2013, IEEE Transactions on Computers.

[18]  Gene H. Golub,et al.  Matrix computations , 1983 .

[19]  Karthikeyan Sankaralingam,et al.  Dark silicon and the end of multicore scaling , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[20]  Constantine Bekas,et al.  A new energy aware performance metric , 2010, Computer Science - Research and Development.

[21]  Matthias S. Müller,et al.  Characterizing the energy consumption of data transfers and arithmetic operations on x86−64 processors , 2010, International Conference on Green Computing.

[22]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[23]  Sharad Malik,et al.  Instruction level power analysis and optimization of software , 1996, J. VLSI Signal Process..

[24]  Constantine Bekas,et al.  Low‐cost data uncertainty quantification , 2012, Concurr. Comput. Pract. Exp..

[25]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[26]  FengWu-chun,et al.  The Green500 List , 2007 .

[27]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[28]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[29]  Gokcen Kestor,et al.  Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[30]  Dong Li,et al.  PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications , 2010, IEEE Transactions on Parallel and Distributed Systems.