Locality-aware mapping and scheduling for multicores

This paper presents a cache hierarchy-aware code mapping and scheduling strategy for multicore architectures. Our mapping strategy determines a loop iteration-to-core mapping by taking into account application data access patterns and on-chip cache hierarchy. It employs a novel concept called “core vectors” to obtain a mapping matrix which exploits data reuses at different layers of the cache hierarchy based on their reuse distances, with the goal of maximizing data locality at each level, while minimizing data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core, with the goal of reducing data reuse distances across the cores for dependence-free loop nests. Our experimental evaluation shows that the proposed mapping scheme reduces miss rates at all levels of caches and application execution time significantly, and when supported by scheduling, the reduction in cache miss rates and execution time become much larger.

[1]  Chen Ding,et al.  A hierarchical model of data locality , 2006, POPL '06.

[2]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[3]  Keshav Pingali,et al.  Optimistic parallelism benefits from data partitioning , 2008, ASPLOS.

[4]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[5]  Monica S. Lam,et al.  Maximizing Parallelism and Minimizing Synchronization with Affine Partitions , 1998, Parallel Comput..

[6]  Monica S. Lam,et al.  Automatic computation and data decomposition for multiprocessors , 1997 .

[7]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[8]  Monica S. Lam,et al.  In search of speculative thread-level parallelism , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[9]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[10]  Mahmut T. Kandemir,et al.  Cache topology aware computation mapping for multicores , 2010, PLDI '10.

[11]  Evangelos P. Markatos,et al.  Using processor affinity in loop scheduling on shared-memory multiprocessors , 1992, Supercomputing '92.

[12]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[13]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[14]  Easwaran Raman,et al.  Spice: speculative parallel iteration chunk execution , 2008, CGO '08.

[15]  Lawrence Rauchwerger,et al.  Sensitivity analysis for automatic parallelization on multi-cores , 2007, ICS '07.

[16]  Mahmut T. Kandemir,et al.  Optimizing shared cache behavior of chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Sally A. McKee,et al.  Global management of cache hierarchies , 2010, CF '10.

[18]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[19]  Yves Robert,et al.  Mapping and load-balancing iterative computations , 2004, IEEE Transactions on Parallel and Distributed Systems.

[20]  Milind Girkar,et al.  Exploitation of nested thread-level speculative parallelism on multi-core systems , 2010, CF '10.

[21]  Paul Feautrier,et al.  Scalable and Structured Scheduling , 2006, International Journal of Parallel Programming.

[22]  Dean M. Tullsen,et al.  Compiler Techniques for Reducing Data Cache Miss Rate on a Multithreaded Architecture , 2008, HiPEAC.

[23]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[24]  Jack J. Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..

[25]  Uday Bondhugula,et al.  Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[26]  Arun Raman,et al.  Speculative parallelization using software multi-threaded transactions , 2010, ASPLOS XV.

[27]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[28]  Uday Bondhugula,et al.  Towards effective automatic parallelization for multicore systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[29]  Edith Schonberg,et al.  A Unified Framework for Optimizing Communication in Data-Parallel Programs , 1996, IEEE Trans. Parallel Distributed Syst..

[30]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[31]  David H. Bailey,et al.  The NAS Parallel Benchmarks 2.0 , 2015 .

[32]  Frédéric Vivien,et al.  Scheduling the Computations of a Loop Nest with Respect to a Given Mapping , 2000, Euro-Par.