Compiler-assisted performance tuning

The enormous and growing complexity of today's high-end systems has increased the already significant challenge of maximizing performance on today's equally complex scientific applications. In this paper, we discuss the role of compiler technology in supporting application developers in a systematic approach to performance tuning of key application components. Based on scenarios taken from manual optimization of scientific codes, we describe how compiler support can enable the programmer to achieve the same or better performance in a much more productive way. We also present examples derived automatically from compiler optimization that show results comparable to hand-tuned performance.

[1]  Chun Chen,et al.  A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization , 2005, LCPC.

[2]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[3]  Ken Kennedy,et al.  Transforming Complex Loop Nests for Locality , 2004, The Journal of Supercomputing.

[4]  Chun Chen,et al.  Model-Guided Empirical Optimization for Multimedia Extension Architectures: A Case Study , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Yoon-Ju Lee,et al.  Empirical Optimization for a Sparse Linear Solver: A Case Study , 2005, International Journal of Parallel Programming.

[6]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[7]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[8]  Antoine Petitet,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .

[9]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[10]  Keshav Pingali,et al.  Think globally, search locally , 2005, ICS '05.

[11]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[12]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[13]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[14]  I-Hsin Chung,et al.  A Case Study Using Automatic Performance Tuning for Large-Scale Scientific Programs , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[15]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[16]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[17]  Ken Kennedy,et al.  Profitable loop fusion and tiling using model-driven empirical search , 2006, ICS '06.

[18]  Marta Jiménez,et al.  Register tiling in nonrectangular iteration spaces , 2002, TOPL.

[19]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[20]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[21]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[22]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[23]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[24]  Yi Wang,et al.  A Combined Hardware/Software Optimization Framework for Signal Representation and Recognition , 2007, International Conference on Computational Science.

[25]  Chun Chen,et al.  Model-guided empirical optimization for memory hierarchy , 2007 .

[26]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[27]  Robert J. Fowler,et al.  HPCVIEW: A Tool for Top-down Analysis of Node Performance , 2002, The Journal of Supercomputing.

[28]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[29]  Chun Chen,et al.  An overview of the ECO project , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[30]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[31]  Chun Chen,et al.  Intelligent Optimization of Parallel and Distributed Applications , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[32]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[33]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.