论文信息 - An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems

An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems

Performance modeling of parallel applications on distributed memory systems is a challenging task due to the effects of CPU speed, memory access time, and communication cost. In this paper, we propose a simple and intuitive graphical model, which extends the widely used Roofline performance model to include the communication cost in addition to the memory access time and the peak CPU performance. This new performance model inherits the simplicity of the original Roofline model and enables performance evaluation on a third dimension of communication performance. Such a model will greatly facilitate and expedite the analysis, development and optimization of parallel programs on high-end computer systems. We empirically validate the extended new Roofline model usingfl oating-point-computation-bound, memory-bound, and communication-bound applications. Three distinct high-end computing platforms have been tested: 1) high performance computing (HPC) systems, 2) high throughput computing systems, and 3) cloud computing systems. Our experimental results with four different parallel applications show that the new model can approximately evaluate the performance of different programs on various distributed-memory systems. Furthermore, the extended new model is able to provide insight into how the problem size can affect the upper bound performance of parallel applications, which is a special property revealed by the new dimension of communication cost analysis.

Fengguang Song | David Cardwell | Fengguang Song | David Cardwell

[1] Andreas Gerstlauer,et al. A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores , 2014, J. Signal Process. Syst..

[2] Rajeev Thakur,et al. Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[3] Tomás F. Pena,et al. 3DyRM: a dynamic roofline model including memory latency information , 2014, The Journal of Supercomputing.

[4] Ki-Hwan Kim,et al. Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model , 2011, Comput. Phys. Commun..

[5] Paul Jähne. Erzeugung minimaler Spannbäume auf ungerichteten, kantengewichteten Graphen mit den Algorithmen von Kruskal, Prim und Boruvka , 2015, GI-Jahrestagung.

[6] Georg Ofenbeck,et al. Applying the roofline model , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7] Ruedi Steinmann. Applying the Rooine Model , 2012 .

[8] Laxmikant V. Kalé,et al. Understanding Application Performance via Micro-benchmarks on Three Large Supercomputers: Intrepid, Ranger and Jaguar , 2010, Int. J. High Perform. Comput. Appl..

[9] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.

[10] Richard W. Vuduc,et al. A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[11] Frederico Pratas,et al. Cache-aware Roofline model: Upgrading the loft , 2014, IEEE Computer Architecture Letters.

[12] Kees Verstoep,et al. Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[13] Guang R. Gao,et al. Extending the Roofline Model for Asynchronous Many-Task Runtimes , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[14] Diego Rossinelli,et al. Mesh–particle interpolations on graphics processing units and multicore central processing units , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[15] Leonel Sousa,et al. Performance Analysis with Cache-Aware Roofline Model in Intel Advisor , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[16] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[17] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[18] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[19] Markus Püschel,et al. Extending the roofline model: Bottleneck analysis with microarchitectural constraints , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[20] James Demmel,et al. the Parallel Computing Landscape , 2022 .

[21] Jae Wook Jeon,et al. A roofline model based on working set size for embedded systems , 2014, IEICE Electron. Express.

[22] Henk Corporaal,et al. The boat hull model: adapting the roofline model to enable performance prediction for parallel computing , 2012, PPoPP '12.

[23] Fabrice Rossi,et al. Mean Absolute Percentage Error for regression models , 2016, Neurocomputing.

[24] Gerth Stølting Brodal,et al. Cache-Oblivious Algorithms and Data Structures , 2004, SWAT.

[25] Emmanuel Jeannot,et al. Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model , 2017, PMBS@SC.