Achieving accurate and context‐sensitive timing for code optimization

Key computational kernels must run near their peak efficiency for most high‐performance computing (HPC) applications. Getting this level of efficiency has always required extensive tuning of the kernel on a particular platform of interest. The success or failure of an optimization is usually measured by invoking a timer. Understanding how to build reliable and context‐sensitive timers is one of the most neglected areas in HPC, and this results in a host of HPC software that looks good when reported in the papers, but delivers only a fraction of the reported performance when used by actual HPC applications. In this paper, we motivate the importance of timer design and then discuss the techniques and methodologies we have developed in order to accurately time HPC kernel routines for our well‐known empirical tuning framework, ATLAS. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[3]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[5]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[6]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[7]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[8]  R. C. Whaley,et al.  Timing high performance kernels through empirical compilation , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[9]  Michael F. P. O'Boyle,et al.  Feedback Assisted Iterative Compilation , 2000 .

[10]  T. Kisuki,et al.  Iterative Compilation in Program Optimization , 2000 .

[11]  Paul van der Mark,et al.  Using Iterative Compilation for Managing Software Pipeline-Unrolling Trade-offs , 1999 .

[12]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[13]  Michael F. P. O'Boyle,et al.  Iterative Compilation , 2002, Embedded Processor Design Challenges.

[14]  Keshav Pingali,et al.  Automatic measurement of memory hierarchy parameters , 2005, SIGMETRICS '05.

[15]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[16]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.