Communication-Aware Supernode Shape

In this paper we revisit the supernode-shape selection problem, that has been widely discussed in bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parallel execution time via an appropriate supernode transformation is very difficult to accomplish, researchers have focused on scheduling-aware supernode transformations that maximize parallelism during the execution. In this paper we argue that the communication volume of the transformed algorithm is an important criterion, and its minimization should be given high priority. For this reason we define the metric of the per process communication volume and propose a method to minimize this metric by selecting a communication-aware supernode shape. Our approach is equivalent to defining a proper Cartesian process grid with MPI_Cart_Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.

[1]  Robert Michael Kirby,et al.  Parallel Scientific Computing in C++ and MPI - A Seamless Approach to Parallel Algorithms and their Implementation , 2003 .

[2]  Weijia Shang,et al.  On Supernode Transformation with Minimized Total Running Time , 1998, IEEE Trans. Parallel Distributed Syst..

[3]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[4]  Nectarios Koziris,et al.  Performance comparison of pure MPI vs hybrid MPI-OpenMP parallelization models on SMP clusters , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Berardino D'Acunto Computational Methods for PDE in Mechanics - (With CD-ROM) , 2004, Series on Advances in Mathematics for Applied Sciences.

[6]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[7]  Peiyi Tang,et al.  Reducing data communication overhead for DOACROSS loop nests , 1994, ICS '94.

[8]  Larry Carter,et al.  On the Parallel Execution Time of Tiled Loops , 2003, IEEE Trans. Parallel Distributed Syst..

[9]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[10]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[12]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[13]  Rumen Andonov,et al.  First Steps Towards Optimal Oblique Tile Sizing , 2007 .

[14]  Sanjay V. Rajopadhye,et al.  Optimal Semi-Oblique Tiling , 2003, IEEE Trans. Parallel Distributed Syst..

[15]  Mahmut Kandemir,et al.  A Unified Tiling Approach for Out-Of-Core Computations , 1996 .

[16]  N. E. Hoskin The solution of partial differential equations , 1989 .

[17]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[18]  Nectarios Koziris,et al.  Minimizing completion time for loop tiling with computation and communication overlapping , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[19]  Erik H. D'Hollander,et al.  Partitioning and Labeling of Loops by Unimodular Transformations , 1992, IEEE Trans. Parallel Distributed Syst..

[20]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[21]  Saeed Parsa,et al.  A New Genetic Algorithm for Loop Tiling , 2006, The Journal of Supercomputing.

[22]  Wentong Cai,et al.  Time-minimal tiling when rise is larger than zero , 2002, Parallel Comput..

[23]  Jingling Xue,et al.  Communication-Minimal Tiling of Uniform Dependence Loops , 1996, J. Parallel Distributed Comput..

[24]  Zhiyuan Li,et al.  IMPACT OF TILE-SIZE SELECTION FOR SKEWED TILING , 2001 .

[25]  W. Shang,et al.  On Time Mapping of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays , 1992, IEEE Trans. Parallel Distributed Syst..

[26]  Peiyi Tang,et al.  Generating efficient tiled code for distributed memory machines , 2000, Parallel Comput..

[27]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[28]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[29]  Nectarios Koziris,et al.  An Efficient Code Generation Technique for Tiled Iteration Spaces , 2003, IEEE Trans. Parallel Distributed Syst..

[30]  Yves Robert,et al.  Static tiling for heterogeneous computing platforms , 1999, Parallel Comput..

[31]  Weijia Shang,et al.  Time Optimal Linear Schedules for Algorithms with Uniform Dependencies , 1991, IEEE Trans. Computers.

[32]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[33]  Yves Robert,et al.  Linear Scheduling Is Nearly Optimal , 1991, Parallel Process. Lett..

[34]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[35]  Nectarios Koziris,et al.  A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping , 2003, J. Parallel Distributed Comput..

[36]  Hiroshi Ohta,et al.  Optimal tile size adjustment in compiling general DOACROSS loop nests , 1995, ICS '95.

[37]  Berardino D'Acunto Computational Methods For PDE In Mechanics , 2004 .

[38]  José A. B. Fortes,et al.  Time optimal linear schedules for algorithms with uniform dependencies , 1988, [1988] Proceedings. International Conference on Systolic Arrays.

[39]  Mahmut T. Kandemir,et al.  A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations , 2000, IEEE Trans. Parallel Distributed Syst..

[40]  Nectarios Koziris,et al.  Message-passing code generation for non-rectangular tiling transformations , 2006, Parallel Comput..

[41]  Weijia Shang,et al.  On Time Optimal Supernode Shape , 2002, IEEE Trans. Parallel Distributed Syst..