Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors

We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises the loop code, problem sizes, and other runtime parameters, while a machine model is an abstraction of all performance-relevant properties of a CPU. We introduce a generic method for determining machine models and present results for relevant server-processor architectures by Intel, AMD, IBM, and Marvell/Cavium. Considering this wide range of architectures, the set of features required for adequate performance modeling is surprisingly small. To validate our approach, we compare performance predictions to empirical data for an OpenMP-parallel preconditioned CG algorithm, which includes compute- and memory-bound kernels. Both single- and multicore analysis shows that the model exhibits average and maximum relative errors of 5% and 10%. Deviations from the model and insights gained are discussed in detail.

[1]  Gerhard Wellein,et al.  Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels , 2017, ArXiv.

[2]  Ronald N. Kalla,et al.  IBM Power9 Processor Architecture , 2017, IEEE Micro.

[3]  Frederico Pratas,et al.  Cache-aware Roofline model: Upgrading the loft , 2014, IEEE Computer Architecture Letters.

[4]  Gerhard Wellein,et al.  Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT , 2019, Supercomput. Front. Innov..

[5]  Gerhard Wellein,et al.  Analytic performance modeling and analysis of detailed neuron simulations , 2019, Int. J. High Perform. Comput. Appl..

[6]  Georg Hager,et al.  On the accuracy and usefulness of analytic energy models for contemporary multicore processors , 2018, ISC.

[7]  Gerhard Wellein,et al.  Chip‐level and multi‐node analysis of energy‐optimized lattice Boltzmann CFD simulations , 2016, Concurr. Comput. Pract. Exp..

[8]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[9]  Sadaf R. Alam,et al.  An Exploration of Performance Attributes for Symbolic Modeling of Emerging Processing Devices , 2007, HPCC.

[10]  Rainald Löhner,et al.  Practical applicability of optimizations and performance models to complex stencil-based loop kernels in CFD , 2019, Int. J. High Perform. Comput. Appl..

[11]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[12]  Dietmar Fey,et al.  An ECM-based Energy-Efficiency Optimization Approach for Bandwidth-Limited Streaming Kernels on Recent Intel Xeon Processors , 2016, 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC).

[13]  Barbara I. Wohlmuth,et al.  Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems , 2015, SIAM J. Sci. Comput..

[14]  Jack J. Dongarra,et al.  Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[15]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[16]  HagerGeorg,et al.  Chip-level and multi-node analysis of energy-optimized lattice Boltzmann CFD simulations , 2016 .

[17]  Georg Ofenbeck,et al.  Applying the roofline model , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[18]  Roger W. Hockney,et al.  F1/2: a Parameter to Characterize Memory and Communication Bottlenecks , 1989, Parallel Comput..

[19]  Ananta Tiwari,et al.  Understanding the performance of stencil computations on Intel's Xeon Phi , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[20]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[21]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[22]  Thomas Rauber,et al.  Applicability of the ECM Performance Model to Explicit ODE Methods on Current Multi-core Processors , 2018, ISC.