Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs.

[1]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[2]  A. WulfWm.,et al.  Hitting the memory wall , 1995 .

[3]  Emilio L. Zapata,et al.  An analytical model of locality-based parallel irregular reductions , 2008, Parallel Comput..

[4]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[5]  Michael Gschwind,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture , 2006, IBM Syst. J..

[6]  Joel H. Saltz,et al.  Principles of runtime support for parallel processors , 1988, ICS '88.

[7]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[8]  Emilio L. Zapata,et al.  Data partitioning‐based parallel irregular reductions , 2004, Concurr. Comput. Pract. Exp..

[9]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[10]  Eduard Ayguadé,et al.  Nanos mercurium: A research compiler for OpenMP , 2004 .

[11]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[12]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[13]  Ibm Redbooks,et al.  Programming the Cell Broadband Engine Architecture: Examples and Best Practices , 2008 .

[14]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, ICS.

[15]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[16]  David A. Padua,et al.  On the Automatic Parallelization of Sparse and Irregular Fortran Programs , 1998, LCR.

[17]  Paul Feautrier,et al.  Array expansion , 1988, ICS '88.

[18]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[19]  Kathryn M. O'Brien,et al.  Optimizing the Use of Static Buffers for DMA on a CELL Chip , 2006, LCPC.

[20]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[21]  Chau-Wen Tseng,et al.  Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.

[22]  Chau-Wen Tseng,et al.  A Comparison of Locality Transformations for Irregular Codes , 2000, LCR.

[23]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[24]  Benjamin Rose,et al.  A comparison of programming models for multiprocessors with explicitly managed memory hierarchies , 2009, PPoPP '09.

[25]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[26]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[27]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[28]  Michael Gschwind,et al.  Optimizing Compiler for the CELL Processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[29]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[30]  P. Feautrier Array expansion , 1988 .

[31]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[32]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[33]  M. Frans Kaashoek,et al.  tcc: A Template-Based Compiler for ‘C , 2007 .

[34]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.