Exposing Tunable Parameters in Multi-threaded Numerical Code

Achieving high performance on today's architectures requires careful orchestration of many optimization parameters. In particular, the presence of shared-caches on multicore architecturesmakes it necessary to consider, in concert, issues related to both parallelism and data locality. This paper presents a systematic and extensive exploration of thecombined search space of transformation parameters that affect both parallelism and data locality inmulti-threaded numerical applications.We characterize the nature of the complex interaction between blocking, problem decomposition and selection of loops for parallelism. We identify key parameters for tuning and provide an automatic mechanism for exposing these parameters to a search tool. A series of experiments on two scientific benchmarks illustrates the non-orthogonality of the transformation search space and reiterates the need for integrated transformation heuristics for achieving high-performance on current multicore architectures.

[1]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[3]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[4]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[5]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[6]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[7]  Stephen F. Jenks,et al.  The Synchronized Pipelined Parallelism Model , 2004 .

[8]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[9]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[11]  Kostas Papadopoulos,et al.  HelperCore_DB: Exploiting Multicore Technology for Databases , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[12]  R. C. Whaley,et al.  Automated transformation for performance-critical kernels , 2007, LCSD '07.

[13]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[14]  Ken Kennedy,et al.  Profitable loop fusion and tiling using model-driven empirical search , 2006, ICS '06.

[15]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[16]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[17]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[18]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[19]  Xipeng Shen,et al.  Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.

[20]  Jack J. Dongarra,et al.  Feedback-directed thread scheduling with memory considerations , 2007, HPDC '07.

[21]  L. Almagor,et al.  Finding effective compilation sequences , 2004, LCTES '04.

[22]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[23]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[24]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  Jack Dongarra,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .