Cache-aware partitioning of multi-dimensional iteration spaces

The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.

[1]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[2]  Jang-Ping Sheu,et al.  Partitioning and Mapping Nested Loops on Multiprocessor Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[3]  Constantine D. Polychronopoulos Loop Coalesing: A Compiler Transformation for Parallel Machines , 1987, ICPP.

[4]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[5]  C. Jousselin,et al.  An algebraic memory model , 1989, CARN.

[6]  Nectarios Koziris,et al.  Evaluation of loop grouping methods based on orthogonal projection spaces , 2000, Proceedings 2000 International Conference on Parallel Processing.

[7]  Constantine D. Polychronopoulos,et al.  Symbolic analysis for parallelizing compilers , 1996, TOPL.

[8]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[9]  Andreas Krall,et al.  Improving semi-static branch prediction by code replication , 1994, PLDI '94.

[10]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[11]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[12]  G. H. Barnes,et al.  A controllable MIMD architecture , 1986 .

[13]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[14]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[15]  Josep Llosa,et al.  Optimizing cache miss equations polyhedra , 2000, CARN.

[16]  Uri C. Weiser,et al.  Nahalal: Cache Organization for Chip Multiprocessors , 2007, IEEE Computer Architecture Letters.

[17]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[18]  Thomas R. Gross,et al.  Using Platform-Specific Performance Counters for Dynamic Compilation , 2005, LCPC.

[19]  Alexandru Nicolau,et al.  A Geometric Approach for Partitioning N-Dimensional Non-rectangular Iteration Spaces , 2004, LCPC.

[20]  Alexander V. Veidenbaum,et al.  EFFECTS OF PROGRAM RESTRUCTURING, ALGORITHM CHANGE, AND ARCHITECTURE CHOICE ON PROGRAM PERFORMANCE. , 1984 .

[21]  Michael O'Boyle,et al.  Program and data transformations for efficient execution on distributed memory architectures , 1993, Technical report series.

[22]  H. V. Jagadish,et al.  An intelligent memory system , 1988, CARN.

[23]  James R. Larus,et al.  Branch prediction for free , 1993, PLDI '93.

[24]  Graham R. Nudd,et al.  Analytical Modeling of Set-Associative Cache Behavior , 1999, IEEE Trans. Computers.

[25]  Emilio L. Zapata,et al.  A compiler tool to predict memory hierarchy performance of scientific codes , 2004, Parallel Comput..

[26]  Alexandru Nicolau,et al.  A novel approach for partitioning iteration spaces with variable densities , 2005, PPoPP.

[27]  Vivek Sarkar,et al.  Parallel Program Graphs and their Classification , 1993, LCPC.

[28]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[29]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[30]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[31]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[32]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, ICS.

[33]  Rizos Sakellariou,et al.  On the Quest for Perfect Load Balance in Loop-Based Parallel Computations , 1996 .

[34]  Erik H. D'Hollander,et al.  Partitioning and Labeling of Loops by Unimodular Transformations , 1992, IEEE Trans. Parallel Distributed Syst..

[35]  Jang-Ping Sheu,et al.  Partitioning and mapping of nested loops for linear array multicomputers , 1995, The Journal of Supercomputing.

[36]  James R. Larus,et al.  Software and the Concurrency Revolution , 2005, ACM Queue.

[37]  Milind Girkar,et al.  A general approach for partitioning N-dimensional parallel nested loops with conditionals , 2006, SPAA '06.

[38]  Thomas R. Gross,et al.  Online optimizations driven by hardware performance monitoring , 2007, PLDI '07.

[39]  David A. Padua,et al.  Execution of Parallel Loops on Parallel Processor Systems , 1986, ICPP.

[40]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[41]  Arogyaswami Paulraj,et al.  Loop partitioning for distributed memory multiprocessors as unimodular transformations , 1991, ICS '91.