On-chip cache hierarchy-aware tile scheduling for multicore machines

Iteration space tiling and scheduling is an important technique for optimizing loops that constitute a large fraction of execution times in computation kernels of both scientific codes and embedded applications. While tiling has been studied extensively in the context of both uniprocessor and multiprocessor platforms, prior research has paid less attention to tile scheduling, especially when targeting multicore machines with deep on-chip cache hierarchies. In this paper, we propose a cache hierarchy-aware tile scheduling algorithm for multicore machines, with the purpose of maximizing both horizontal and vertical data reuses in on-chip caches, and balancing the workloads across different cores. This scheduling algorithm is one of the key components in a source-to-source translation tool that we developed for automatic loop parallelization and multithreaded code generation from sequential codes. To the best of our knowledge, this is the first effort that develops a fully-automated tile scheduling strategy customized for on-chip cache topologies of multicore machines. The experimental results collected by executing twelve application programs on three commercial Intel machines (Nehalem, Dunnington, and Harpertown) reveal that our cache-aware tile scheduling brings about 27.9% reduction in cache misses, and on average, 13.5% improvement in execution times over an alternate method tested.

[1]  William J. Dally,et al.  Compilation for explicitly managed memory hierarchies , 2007, PPOPP.

[2]  Vivek Sarkar,et al.  An analytical model for loop tiling and its solution , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[3]  Uday Bondhugula,et al.  Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors , 2009, PPoPP '09.

[4]  Vincent Loechner,et al.  Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions , 2007, Algorithmica.

[5]  Lawrence Rauchwerger,et al.  Design and Use of htalib - A Library for Hierarchically Tiled Arrays , 2006, LCPC.

[6]  Mahmut T. Kandemir,et al.  Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[7]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[8]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[9]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[10]  Mahmut T. Kandemir,et al.  Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines , 2000, J. Parallel Distributed Comput..

[11]  David A. Padua,et al.  Hierarchically tiled arrays for parallelism and locality , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[12]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[13]  Nectarios Koziris,et al.  Selecting the tile shape to reduce the total communication volume , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[14]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[15]  S. Krishnamoorthy,et al.  Affine Transformations for Communication Minimal Parallelization and Locality Optimization of Arbitrarily Nested Loop Sequences , 2007 .

[16]  Alexander V. Veidenbaum,et al.  Cache-aware partitioning of multi-dimensional iteration spaces , 2009, SYSTOR '09.

[17]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[18]  Jingling Xue,et al.  Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.

[19]  Refael Hassin,et al.  Approximation Algorithms for Minimum K -Cut , 2000, Algorithmica.

[20]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[21]  Jim Held "Single-chip Cloud Computer", an IA Tera-scale Research Processor , 2010, Euro-Par Workshops.

[22]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[23]  J. H. Wilkinson,et al.  Handbook for Automatic Computation. Vol II, Linear Algebra , 1973 .

[24]  Mahmut T. Kandemir,et al.  Cache topology aware computation mapping for multicores , 2010, PLDI '10.

[25]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[26]  P. Sadayappan,et al.  Iteration space tiling for distributed memory machines , 1992 .

[27]  Mahmut T. Kandemir,et al.  Optimizing shared cache behavior of chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[29]  Monica S. Lam,et al.  Maximizing parallelism and minimizing synchronization with affine transforms , 1997, POPL '97.

[30]  Mikel Luján,et al.  Adaptive Loop Tiling for a Multi-cluster CMP , 2008, ICA3PP.

[31]  Monica S. Lam,et al.  Data Dependence and Data-Flow Analysis of Arrays , 1992, LCPC.

[32]  Max B Aron The single-chip cloud computer , 2010 .

[33]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[34]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[35]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[36]  David Parello,et al.  Facilitating the search for compositions of program transformations , 2005, ICS '05.

[37]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[38]  Monica S. Lam,et al.  Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.

[39]  Bowen Alpern,et al.  Modeling parallel computers as memory hierarchies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[40]  Cédric Bastoul,et al.  Efficient code generation for automatic parallelization and optimization , 2003, Second International Symposium on Parallel and Distributed Computing, 2003. Proceedings..