Towards Automatic Compiler-assisted Performance and Energy Modeling for Message Passing Parallel Programs

Optimizing programs for modern distributed memory parallel architectures is a notoriously difficult task that generated the need for modeling tools that can estimate the execution time and energy consumption for message passing programs. Many prediction tools require substantial manual effort, excessive training for every given architecture or limit the class of input programs that can be handled. We present a compiler-based approach that automatically generates parametrized analytical models. While requiring only a minimum training overhead on target architectures it still provides reasonably accurate models for execution time and energy consumption of message passing programs. Our method uses compiler analyses to identify the structure of code regions of input programs, and extracts important parameters such as loop iteration counts or message buffer sizes. We can then predict the performance of these code regions for new problem sizes and target machines. We show that compiler knowledge can be effectively used to minimize training overhead and evaluate our approach on multiple target applications with varying problem and machine sizes. Initial results obtained with our prototype implementation show a mean coefficient of determination (R(exp)2) of 0:93 over 7 input programs.

[1]  Michael Lang,et al.  Energy modeling of supercomputers and large-scale scientific applications , 2013, 2013 International Green Computing Conference Proceedings.

[2]  Seyong Lee,et al.  COMPASS: A Framework for Automated Performance Modeling and Prediction , 2015, ICS.

[3]  William Jalby,et al.  MAQAO : Modular Assembler Quality Analyzer and Optimizer for Itanium 2 , 2005 .

[4]  Ananta Tiwari,et al.  Modeling Power and Energy Usage of HPC Kernels , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[5]  Rami G. Melhem,et al.  Energy Consumption of Resilience Mechanisms in Large Scale Systems , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[6]  Zhiling Lan,et al.  Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart , 2015, IEEE Transactions on Computers.

[7]  Adolfy Hoisie,et al.  Palm: easing the burden of analytical performance modeling , 2014, ICS '14.

[8]  Torsten Hoefler,et al.  Using Compiler Techniques to Improve Automatic Performance Modeling , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[9]  John Cavazos,et al.  Software Automatic Tuning: Concepts and State-of-the-Art Results , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.

[10]  Gerhard Wellein,et al.  Automatic loop kernel analysis and performance modeling with Kerncraft , 2015, PMBS '15.

[11]  Torsten Hoefler,et al.  Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations , 2014, Supercomput. Front. Innov..

[12]  Paul D. Hovland,et al.  Generating Performance Bounds from Source Code , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[13]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[14]  Samuel Williams,et al.  Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis , 2014, PMBS@SC.

[15]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[16]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[17]  Thomas Fahringer,et al.  Multi-Objective Auto-Tuning with Insieme: Optimization and Trade-Off Analysis for Time, Energy and Resource Usage , 2014, Euro-Par.

[18]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Thomas Fahringer,et al.  A Region-Aware Multi-Objective Auto-Tuner for Parallel Programs , 2017, 2017 46th International Conference on Parallel Processing Workshops (ICPPW).

[20]  Samuel Williams,et al.  ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..

[21]  Thomas Fahringer,et al.  A High-Level IR Transformation System , 2013, Euro-Par Workshops.

[22]  Hermann Härtig,et al.  Measuring energy consumption for short code paths using RAPL , 2012, PERV.

[23]  Nicholas Nethercote,et al.  Dynamic Binary Analysis and Instrumentation , 2004 .

[24]  Thomas Fahringer,et al.  Energy Prediction of OpenMP Applications Using Random Forest Modeling Approach , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[25]  Dong Li,et al.  Power-aware MPI task aggregation prediction for high-end computing systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[26]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[27]  Thomas Rauber,et al.  Modeling the energy consumption for concurrent executions of parallel tasks , 2011, SpringSim.

[28]  Torsten Hoefler,et al.  PEMOGEN: Automatic adaptive performance modeling during program runtime , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[29]  Thomas Fahringer,et al.  A multi-objective auto-tuning framework for parallel codes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.