Efficient Scheduling of Nested Parallel Loops on Multi-Core Systems

Parallel loops, such as a parallel DO loop, in Fortran, account for large percentage of the total execution time. Given this, we focus on the problem of how to efficiently schedule nested perfect/non-perfect parallel loops on the emerging multi-core systems. In this regard, one of the key aspects is how to determine the profitability of parallel execution and how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel Xeon based multiprocessor using several kernels from the industry-standard benchmarks.

[1]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[2]  Thomas R. Gross,et al.  Using Platform-Specific Performance Counters for Dynamic Compilation , 2005, LCPC.

[3]  Alexandru Nicolau,et al.  A Geometric Approach for Partitioning N-Dimensional Non-rectangular Iteration Spaces , 2004, LCPC.

[4]  Jang-Ping Sheu,et al.  Partitioning and Mapping Nested Loops on Multiprocessor Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[5]  Erik H. D'Hollander,et al.  Partitioning and Labeling of Loops by Unimodular Transformations , 1992, IEEE Trans. Parallel Distributed Syst..

[6]  Susan J. Eggers,et al.  The effectiveness of multiple hardware contexts , 1994, ASPLOS VI.

[7]  Aart Johannes Casimir Bik The software vectorization handbook , 2004 .

[8]  Uri C. Weiser,et al.  Nahalal: Cache Organization for Chip Multiprocessors , 2007, IEEE Computer Architecture Letters.

[9]  Toshiaki Yasue,et al.  A region-based compilation technique for a Java just-in-time compiler , 2003, PLDI '03.

[10]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[11]  Antonia Zhai,et al.  Loop Selection for Thread-Level Speculation , 2005, LCPC.

[12]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[13]  Alexander V. Veidenbaum,et al.  Cache-aware iteration space partitioning , 2008, PPoPP.

[14]  Constantine D. Polychronopoulos,et al.  Symbolic analysis for parallelizing compilers , 1996, TOPL.

[15]  Rizos Sakellariou,et al.  On the Quest for Perfect Load Balance in Loop-Based Parallel Computations , 1996 .

[16]  Graham R. Nudd,et al.  Analytical Modeling of Set-Associative Cache Behavior , 1999, IEEE Trans. Computers.

[17]  Milind Girkar,et al.  A general approach for partitioning N-dimensional parallel nested loops with conditionals , 2006, SPAA '06.

[18]  Thomas R. Gross,et al.  Online optimizations driven by hardware performance monitoring , 2007, PLDI '07.

[19]  Emilio L. Zapata,et al.  A compiler tool to predict memory hierarchy performance of scientific codes , 2004, Parallel Comput..

[20]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[21]  Nectarios Koziris,et al.  Evaluation of loop grouping methods based on orthogonal projection spaces , 2000, Proceedings 2000 International Conference on Parallel Processing.

[22]  Yang Liu,et al.  A region-based compilation infrastructure , 2003, Seventh Workshop on Interaction Between Compilers and Computer Architectures, 2003. INTERACT-7 2003. Proceedings..

[23]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[24]  David A. Padua,et al.  Execution of Parallel Loops on Parallel Processor Systems , 1986, ICPP.

[25]  Jang-Ping Sheu,et al.  Partitioning and mapping of nested loops for linear array multicomputers , 1995, The Journal of Supercomputing.

[26]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[27]  Constantine D. Polychronopoulos Loop Coalesing: A Compiler Transformation for Parallel Machines , 1987, ICPP.

[28]  Aart J. C. Bik Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .

[29]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[30]  Alexandru Nicolau,et al.  A novel approach for partitioning iteration spaces with variable densities , 2005, PPoPP.

[31]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .