Programmable and Scalable Reductions on Clusters

Reductions matter and they are here to stay. Wide adoption of parallel processing hardware in a broad range of computer applications has encouraged recent research efforts on their efficient parallelization. Furthermore, trends towards high productivity languages in mainstream computing increases the demand for efficient programming support. In this paper we present a new approach on parallel reductions for distributed memory systems that provides both scalability and programmability. Using OmpSs, a task-based parallel programming model, the developer has the ability to express scalable reductions through a single pragma annotation. This pragma annotation is applicable for tasks as well as for work-sharing constructs (with implicit tasking) and instructs the compiler to generate the required runtime calls. The supporting runtime handles data and task distribution, parallel execution and data reduction. Scalability is achieved through a software cache that maximizes local and temporal data reuse and allows overlapped computation and communication. Results confirm scalability for up to 32 12-core cluster nodes.

[1]  Alan L. Cox,et al.  Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reordering , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[2]  Chau-Wen Tseng,et al.  A comparison of parallelization techniques for irregular reductions , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[3]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[4]  Josep Torrellas,et al.  Architectural support for parallel reductions in scalable shared-memory multiprocessors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[5]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[6]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[7]  Lawrence Rauchwerger,et al.  Polaris: The Next Generation in Parallelizing Compilers , 2000 .

[8]  Robert S. Schreiber,et al.  Hpf-2 scope of activities and motivating applications , 1994 .

[9]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[10]  Chau-Wen Tseng,et al.  Improving compiler and run-time support for adaptive irregular codes , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[11]  Emilio L. Zapata,et al.  A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors , 2000, ICS '00.

[12]  Alejandro Duran,et al.  A Proposal for User-Defined Reductions in OpenMP , 2010, IWOMP.