Speeding up Nek5000 with autotuning and specialization

Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation, in order to select the best-performing solution for a particular architecture. Specialization optimizes code customized to a particular class of input data set. In this paper, we demonstrate how compiler-based autotuning that incorporates specialization for expected data set sizes of key computations can be used to speed up Nek5000, a spectral-element code. Nek5000 makes heavy use of what are effectively Basic Linear Algebra Subroutine (BLAS) calls, but for very small matrices. Through autotuning and specialization, we can achieve significant performance gains over hand-tuned libraries (e.g., Goto, ATLAS, and ACML BLAS). Additional performance gains are obtained from using higher-level compiler optimizations that aggregate multiple BLAS calls. We demonstrate more than 2.2X performance gains on an Opteron over the original manually tuned implementation, and speedups of up to 1.26X on the entire application running on 256 nodes of the Cray XT5 Jaguar system at Oak Ridge.

[1]  A. Patera A spectral element method for fluid dynamics: Laminar flow in a channel expansion , 1984 .

[2]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[3]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[4]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[5]  H.M. Tufo,et al.  Terascale Spectral Element Algorithms and Implementations , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[6]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[7]  Robert A. van de Geijn,et al.  High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories , 2001 .

[8]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[9]  P. Fischer,et al.  High-Order Methods for Incompressible Fluid Flow , 2002 .

[10]  Juan J. Navarro,et al.  Improving Performance of Hypermatrix Cholesky Factorization , 2003, Euro-Par.

[11]  Yunheung Paek,et al.  Finding effective optimization phase sequences , 2003 .

[12]  Jaewook Shin,et al.  Exploiting Superword-Level Locality in Multimedia Extension Architectures , 2003, J. Instr. Level Parallelism.

[13]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[14]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[15]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[16]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[17]  William Jalby,et al.  Iterative Compilation with Kernel Exploration , 2006, LCPC.

[18]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[19]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[20]  Chun Chen,et al.  Model-guided empirical optimization for memory hierarchy , 2007 .

[21]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[22]  P. Fischer,et al.  Petascale algorithms for reactor hydrodynamics , 2008 .

[23]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[24]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[26]  Markus Püschel,et al.  Computer Generation of General Size Linear Transform Libraries , 2009, 2009 International Symposium on Code Generation and Optimization.

[27]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[28]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[29]  Chun Chen,et al.  Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.