A pipelined execution of tiled nested loops on SMPs with computation and communication overlapping

This paper proposes a novel approach for the parallel execution of tiled iteration spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCI-SCI network interface card. We apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. In this way, intranode (intragroup) communication is annihilated. Groups are atomically executed inside each node. Nodes exchange data between successive group computations. We schedule groups much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive atomic group executions. The applied non-blocking schedule resembles a pipelined datapath where group computation phases are overlapped with communication ones, instead of being interleaved with them. Our experimental results illustrate that the proposed method outperforms previous approaches involving blocking communication or conventional grouping schemes.

[1]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[2]  Nectarios Koziris,et al.  Evaluation of loop grouping methods based on orthogonal projection spaces , 2000, Proceedings 2000 International Conference on Parallel Processing.

[3]  Jang-Ping Sheu,et al.  Partitioning and Mapping Nested Loops on Multiprocessor Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[4]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[5]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[6]  Weijia Shang,et al.  On supernode transformation with minimized total running time , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.

[7]  Nectarios Koziris,et al.  Optimal Scheduling for UET/UET-UCT Generalized n-Dimensional Grid Task Graphs , 1999, J. Parallel Distributed Comput..

[8]  Chung-Ta King,et al.  Pipelined Data Parallel Algorithms-II: Design , 1990, IEEE Trans. Parallel Distributed Syst..

[9]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[10]  Donald J. Patterson,et al.  Computer organization and design: the hardware-software interface (appendix a , 1993 .

[11]  Nectarios Koziris,et al.  Enhancing the performance of tiled loop execution onto clusters using memory mapped network interfaces and pipelined schedules , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[12]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[13]  Nectarios Koziris,et al.  Minimizing completion time for loop tiling with computation and communication overlapping , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[14]  T. KingC.,et al.  Pipelined Data Parallel Algorithms-I , 1990 .

[15]  Nectarios Koziris,et al.  Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays , 2000, IEEE Trans. Parallel Distributed Syst..

[16]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..