Balanced, Locality-Based Parallel Irregular Reductions

Much effort has been devoted recently to efficiently parallelize irregular reductions. Different parallelization techniques have been proposed during the last years that can be classified into two groups: LPO (Loop Partitioning Oriented methods) and DPO (Data Partitioning Oriented methods). We have analyzed both classes in terms of a set of performance aspects: data locality, memory overhead, parallelism and workload balancing. Load balancing is not an issue sufficiently analyzed in the literature in parallel reduction methods, specially those in the DPO class. In this paper we propose two techniques to introduce load balancing into a DPO method. The first technique is generic, as it can deal with any kind of load unbalancing present in the problem domain. The second technique handles a special case of load unbalancing, appearing when there are a large number of write operations on small regions of the reduction arrays. Efficient implementations of the proposed solutions to load balancing for an example DPO method are presented. Experiments on static and dynamic kernel codes were conducted making comparisons with other parallel reduction methods.

[1]  Søren Toxvaerd,et al.  Algorithms for canonical molecular dynamics simulations , 1991 .

[2]  David A. Padua,et al.  On the Automatic Parallelization of Sparse and Irregular Fortran Programs , 1998, LCR.

[3]  Chau-Wen Tseng,et al.  Improving Locality for Adaptive Irregular Scientific Codes , 2000, LCPC.

[4]  Geoffrey C. Fox,et al.  RUNTIME SUPPORT AND COMPILATION METHODS FOR USER-SPECIFIED DATE DISTRIBUTIONS , 1993 .

[5]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[6]  Chau-Wen Tseng,et al.  A comparison of parallelization techniques for irregular reductions , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[7]  Nandini Mukherjee,et al.  A comparative analysis of four parallelisation schemes , 1999, ICS '99.

[8]  Emilio L. Zapata,et al.  A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors , 2000, ICS '00.

[9]  Geoffrey C. Fox,et al.  Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions , 1995, IEEE Trans. Parallel Distributed Syst..

[10]  Lawrence Rauchwerger,et al.  Adaptive reduction parallelization techniques , 2000, ICS '00.

[11]  Rafael Asenjo,et al.  Automatic parallelization of irregular applications , 2000, Parallel Comput..

[12]  Emilio L. Zapata,et al.  On Automatic Parallelization of Irregular Reductions on Scalable Shared Memory Systems , 1999, Euro-Par.

[13]  Chau-Wen Tseng,et al.  Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes , 1998, LCPC.

[14]  Juan J. Morales,et al.  The cell-neighbor table method in molecular dynamics simulations , 1992 .

[15]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[16]  Chau-Wen Tseng,et al.  Efficient compiler and run-time support for parallel irregular reductions , 2000, Parallel Comput..