Automatic loop kernel analysis and performance modeling with Kerncraft

Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the single-core performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.

[1]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[2]  Roger W. Hockney,et al.  F1/2: a Parameter to Characterize Memory and Communication Bottlenecks , 1989, Parallel Comput..

[3]  William Jalby,et al.  MAQAO : Modular Assembler Quality Analyzer and Optimizer for Itanium 2 , 2005 .

[4]  Gerhard Wellein,et al.  Performance Patterns and Hardware Metrics on Modern Multicore Processors: Best Practices for Performance Engineering , 2012, Euro-Par Workshops.

[5]  William Kahan,et al.  Pracniques: further remarks on reducing truncation errors , 1965, CACM.

[6]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[7]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[8]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[9]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[10]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[11]  Samuel Williams,et al.  ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..

[12]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[13]  Gerhard Wellein,et al.  likwid-bench: An Extensible Microbenchmarking Platform for x86 Multicore Compute Nodes , 2011, Parallel Tools Workshop.

[14]  Dietmar Fey,et al.  Execution-Cache-Memory Performance Model: Introduction and Validation , 2015, ArXiv.

[15]  H. T. Kung Memory requirements for balanced computer architectures , 1986, ISCA '86.

[16]  Samuel Williams,et al.  Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis , 2014, PMBS@SC.

[17]  Thomas Ilsche,et al.  An Energy Efficiency Feature Survey of the Intel Haswell Processor , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[18]  Paul D. Hovland,et al.  Generating Performance Bounds from Source Code , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[19]  Gerhard Wellein,et al.  Performance Analysis of the Kahan-Enhanced Scalar Product on Current Multicore Processors , 2015, PPAM.