Applying the roofline model

The recently introduced roofline model plots the performance of executed code against its operational intensity (operations count divided by memory traffic). It also includes two platform-specific performance ceilings: the processor's peak performance and a ceiling derived from the memory bandwidth, which is relevant for code with low operational intensity. The model thus makes more precise the notions of memory- and compute-bound and, despite its simplicity, can provide an insightful visualization of bottlenecks. As such it can be valuable to guide manual code optimization as well as in education. Unfortunately, to date the model has been used almost exclusively with back-of-the-envelope calculations and not with measured data. In this paper we show how to produce roofline plots with measured data on recent generations of Intel platforms. We show how to accurately measure the necessary quantities for a given program using performance counters, including threaded and vectorized code, and for warm and cold cache scenarios. We explain the measurement approach, its validation, and discuss limitations. Finally, we show, to this extent for the first time, a set of roofline plots with measured data for common numerical functions on a variety of platforms and discuss their possible uses.

[1]  R. Clint Whaley,et al.  Achieving accurate and context‐sensitive timing for code optimization , 2008, Softw. Pract. Exp..

[2]  H. T. Kung Memory requirements for balanced computer architectures , 1986, ISCA '86.

[3]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[4]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[5]  Jeanine Cook,et al.  Toward Accurate Performance Evaluation using Hardware Counters , 2003 .

[6]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[7]  Ki-Hwan Kim,et al.  Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model , 2011, Comput. Phys. Commun..

[8]  Matthias Hauswirth,et al.  Accuracy of performance counter measurements , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[9]  Diego Rossinelli,et al.  Mesh–particle interpolations on graphics processing units and multicore central processing units , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[10]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[11]  Rob van Nieuwpoort,et al.  Using many-core hardware to correlate radio astronomy signals , 2009, ICS.

[12]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[13]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[14]  Laxmikant V. Kalé,et al.  Understanding Application Performance via Micro-benchmarks on Three Large Supercomputers: Intrepid, Ranger and Jaguar , 2010, Int. J. High Perform. Comput. Appl..

[15]  Matthias Hauswirth,et al.  We have it easy, but do we have it right? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[16]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[17]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[18]  Richard W. Vuduc,et al.  Balance Principles for Algorithm-Architecture Co-Design , 2011, HotPar.

[19]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[20]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[21]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[22]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[24]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[25]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[26]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[27]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .