Efficient Utilization of Memory Mapped NICs onto Clusters using Pipelined Schedules

This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We make use of DMA communication mode, to send data to other nodes, while the CPU performs useful calculations. Zero-copy communication is achieved through pinned-down physical memory regions, provided by NIC's driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a Linux PC cluster with PCI-SCI NICs (Dolphin D330). Tiles are essentially exchanging data and should also have large Computational grain, so that their parallel execution becomes beneficial. We schedule tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. The applied nonblocking schedule resembles a pipelined data-path where computation phases are overlapped with communication ones, instead of being interleaved with them. Experimental evaluation illustrates that when using enhanced communication features such as DMA transfers, memory-mapped interfaces and zero-copy mechanisms, overall performance is considerably improved compared to using conventional, CPU and kernel bounded, communication primitives.

[1]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[2]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[3]  Knut Omang,et al.  VIA over SCI - consequences of a zero copy implementation, and comparison with VIA over myrinet , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[4]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[5]  Nectarios Koziris,et al.  Optimal Scheduling for UET/UET-UCT Generalized n-Dimensional Grid Task Graphs , 1999, J. Parallel Distributed Comput..

[6]  Donald J. Patterson,et al.  Computer organization and design: the hardware-software interface (appendix a , 1993 .

[7]  Matthias A. Blumrich Network interface for protected, user-level communication , 1996 .

[8]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[9]  Nectarios Koziris,et al.  Minimizing completion time for loop tiling with computation and communication overlapping , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[10]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[11]  Weijia Shang,et al.  On supernode transformation with minimized total running time , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.

[12]  Hermann Hellwagner,et al.  SISCI - Implementing a Standard Software Infrastructure on an SCI Cluster , 1997 .

[13]  Wolfgang Rehm,et al.  Memory Management in a Combined VIA/SCI Hardware , 2000, IPDPS Workshops.

[14]  Andrew A. Chien,et al.  Software overhead in messaging layers: where does the time go? , 1994, ASPLOS VI.

[15]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[16]  Yutaka Ishikawa,et al.  MPICH-PM: Design and Implementation of Zero Copy MPI for PM , 1998 .

[17]  Hermann Hellwagner The SCI Standard and Applications of SCI , 1999, Scalable Coherent Interface.