A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping

This paper proposes a new method for the problem of minimizing the execution time of nested for-loops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to nonlocal processors. We survey the application of our schedule to modern communication architectures. We performed two sets of experimental results, one using MPI primitives over FastEthernet and one using the SISCI API over an SCI network. In both cases, the total completion time is significantly reduced.

[1]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[2]  Matthias A. Blumrich Network interface for protected, user-level communication , 1996 .

[3]  Weijia Shang,et al.  Independent Partitioning of Algorithms with Uniform Dependencies , 1992, IEEE Trans. Computers.

[4]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[5]  Dhabaleswar K. Panda,et al.  Design Alternatives for Virtual Interface Architecture and an Implementation on IBM Netfinity NT Cluster , 2001, J. Parallel Distributed Comput..

[6]  Hermann Hellwagner,et al.  SISCI - Implementing a Standard Software Infrastructure on an SCI Cluster , 1997 .

[7]  Nectarios Koziris,et al.  Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays , 2000, IEEE Trans. Parallel Distributed Syst..

[8]  Erik H. D'Hollander,et al.  Partitioning and Labeling of Loops by Unimodular Transformations , 1992, IEEE Trans. Parallel Distributed Syst..

[9]  Nectarios Koziris,et al.  Evaluation of loop grouping methods based on orthogonal projection spaces , 2000, Proceedings 2000 International Conference on Parallel Processing.

[10]  Andrew A. Chien,et al.  Software overhead in messaging layers: where does the time go? , 1994, ASPLOS VI.

[11]  Hiroshi Tezuka,et al.  The design and implementation of zero copy MPI using commodity hardware with a high performance network , 1998, ICS '98.

[12]  Jingling Xue,et al.  Communication-Minimal Tiling of Uniform Dependence Loops , 1996, J. Parallel Distributed Comput..

[13]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[14]  Knut Omang,et al.  VIA over SCI - consequences of a zero copy implementation, and comparison with VIA over myrinet , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[15]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[16]  Weijia Shang,et al.  Time Optimal Linear Schedules for Algorithms with Uniform Dependencies , 1991, IEEE Trans. Computers.

[17]  Nectarios Koziris,et al.  Optimal Scheduling for UET/UET-UCT Generalized n-Dimensional Grid Task Graphs , 1999, J. Parallel Distributed Comput..

[18]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[19]  Weijia Shang,et al.  On Supernode Transformation with Minimized Total Running Time , 1998, IEEE Trans. Parallel Distributed Syst..

[20]  Wolfgang Rehm,et al.  Memory Management in a Combined VIA/SCI Hardware , 2000, IPDPS Workshops.