Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces

This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We present a novel, pipelined scheduling approach which takes advantage of DMA communication mode, to send data to other nodes, while the CPUs are performing calculations. We also use zero-copy communication through pinned-down physical memory regions, provided by NIC’s driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a cluster of SMP nodes with single PCI-SCI NICs inside each node. In order to schedule tiles, we apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. Experimental evaluation illustrates that memory mapped NICs with enhanced communication features enable the use of a more advanced pipelined (overlapping) schedule, which considerably improves performance, compared to an ordinary blocking schedule, implemented with conventional, CPU and kernel bounded, communication primitives.

[1]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[2]  Hermann Hellwagner The SCI Standard and Applications of SCI , 1999, Scalable Coherent Interface.

[3]  Nectarios Koziris,et al.  Enhancing the performance of tiled loop execution onto clusters using memory mapped network interfaces and pipelined schedules , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[4]  Hiroshi Tezuka,et al.  The design and implementation of zero copy MPI using commodity hardware with a high performance network , 1998, ICS '98.

[5]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[6]  Nectarios Koziris,et al.  Optimal Scheduling for UET/UET-UCT Generalized n-Dimensional Grid Task Graphs , 1999, J. Parallel Distributed Comput..

[7]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[8]  Hermann Hellwagner,et al.  SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Compute Clusters , 1999 .

[9]  Matthias A. Blumrich Network interface for protected, user-level communication , 1996 .

[10]  Nectarios Koziris,et al.  Minimizing completion time for loop tiling with computation and communication overlapping , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[11]  Nectarios Koziris,et al.  Automatic code generation for executing tiled nested loops onto parallel architectures , 2002, SAC '02.

[12]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[13]  Andrew A. Chien,et al.  Software overhead in messaging layers: where does the time go? , 1994, ASPLOS VI.

[14]  Knut Omang,et al.  VIA over SCI - consequences of a zero copy implementation, and comparison with VIA over myrinet , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[15]  Weijia Shang,et al.  On Supernode Transformation with Minimized Total Running Time , 1998, IEEE Trans. Parallel Distributed Syst..

[16]  Donald J. Patterson,et al.  Computer organization and design: the hardware-software interface (appendix a , 1993 .

[17]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.