CHiLL : A Framework for Composing High-Level Loop Transformations

This paper describes a general and robust loop transformation framework that enables compilers to generate efficient code on complex loop nests. Despite two decades of prior research on loop optimization, performance of compiler-generated code often falls short of manually optimized versions, even for some well-studied BLAS kernels. There are two primary reasons for this. First, today’s compilers employ fixed transformation strategies, making it difficult to adapt to different optimization requirements for different application codes. Second, code transformations are treated in isolation, not taking into account the interactions between different transformations. This paper addresses such limitations in a unified framework that supports a broad collection of transformations, (permutation, tiling, unroll-and-jam, data copying, iteration space splitting, fusion, distribution and others), which go beyond existing polyhedral transformation models. This framework is a key element of a compiler we are developing which performs empirical optimization to evaluate a collection of alternative optimized variants of a code segment. A script interface to code generation and empirical search permits transformation parameters to be adjusted independently and tested; alternative scripts are used to represent different code variants. By applying this framework to example codes, we show performance results on automaticallygenerated code for the Pentium M and MIPS R10000 that are comparable to the best hand-tuned codes, and significantly better (up to a 14x speedup) than the native compilers.

[1]  W. Pugh,et al.  A framework for unifying reordering transformations , 1993 .

[2]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[3]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[4]  Jingling Xue Automating Non-Unimodular Loop Transformations for Massive Parallelism , 1994, Parallel Comput..

[5]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[6]  Loop Transformation Using Nonunimodular Matrices , 1995, IEEE Trans. Parallel Distributed Syst..

[7]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[8]  William Pugh,et al.  Optimization within a unified transformation framework , 1996 .

[9]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[10]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[11]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[12]  Monica S. Lam,et al.  Maximizing Parallelism and Minimizing Synchronization with Affine Partitions , 1998, Parallel Comput..

[13]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[14]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[15]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[16]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[17]  Marta Jiménez,et al.  Register tiling in nonrectangular iteration spaces , 2002, TOPL.

[18]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[19]  Ken Kennedy,et al.  Transforming Complex Loop Nests for Locality , 2004, The Journal of Supercomputing.

[20]  A data locality optimizing algorithm , 2004, SIGP.

[21]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[22]  J. Ramanujam,et al.  Beyond unimodular transformations , 1995, The Journal of Supercomputing.

[23]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[24]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[25]  Keshav Pingali,et al.  Think globally, search locally , 2005, ICS '05.

[26]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[27]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[28]  William Jalby,et al.  Loop Optimization using Hierarchical Compilation and Kernel Decomposition , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[29]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[30]  Keshav Pingali,et al.  A singular loop transformation framework based on non-singular matrices , 1992, International Journal of Parallel Programming.