Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs

This paper proposes a novel approach for the parallel execution of tiled Iteration Spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCI-SCI Network Interface Card. We apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. In this way, intranode (intragroup) communication is annihilated. Groups are atomically executed inside each node. Nodes exchange data between successive group computations. We schedule groups much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive atomic group executions. The applied non-blocking schedule resembles a pipelined datapath, where group computation phases are overlapped with communication ones, instead of being interleaved with them. Our experimental results illustrate that the proposed method outperforms previous approaches involving blocking communication or conventional grouping schemes.

[1]  Yves Robert,et al.  Determining the idle time of a tiling: new results , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[2]  Hamid R. Arabnia,et al.  Parallel Computer Vision on a Reconfigurable Multiprocessor Network , 1997, IEEE Trans. Parallel Distributed Syst..

[3]  Nectarios Koziris,et al.  Minimizing completion time for loop tiling with computation and communication overlapping , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[4]  Richard P. Martin,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[5]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[6]  Knut Omang,et al.  VIA over SCI - consequences of a zero copy implementation, and comparison with VIA over myrinet , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[7]  Weijia Shang,et al.  On Time Optimal Supernode Shape , 2002, IEEE Trans. Parallel Distributed Syst..

[8]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[9]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[10]  Matthias A. Blumrich Network interface for protected, user-level communication , 1996 .

[11]  Weijia Shang,et al.  On supernode transformation with minimized total running time , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.

[12]  J.P. Singh,et al.  Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[13]  Tarek S. Abdelrahman,et al.  Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors , 2001, IEEE Trans. Parallel Distributed Syst..

[14]  Chung-Ta King,et al.  Pipelined Data Parallel Algorithms-II: Design , 1990, IEEE Trans. Parallel Distributed Syst..

[15]  Sanjay V. Rajopadhye,et al.  A Geometric Programming Framework for Optimal Multi-Level Tiling , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[16]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[17]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[18]  Angelos Bilas,et al.  User-Space Communication: A Quantitative Study , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[19]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[20]  Donald J. Patterson,et al.  Computer organization and design: the hardware-software interface (appendix a , 1993 .

[21]  Mahmut T. Kandemir,et al.  Improving Cache Locality by a Combination of Loop and Data Transformation , 1999, IEEE Trans. Computers.

[22]  Larry Carter,et al.  Determining the idle time of a tiling , 1997, POPL '97.

[23]  Hamid R. Arabnia,et al.  Parallel stereocorrelation on a reconfigurable multi-ring network , 1996, The Journal of Supercomputing.

[24]  Hiroshi Tezuka,et al.  Pin-down cache: a virtual memory management technique for zero-copy communication , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[25]  Nectarios Koziris,et al.  An Efficient Code Generation Technique for Tiled Iteration Spaces , 2003, IEEE Trans. Parallel Distributed Syst..

[26]  Rajeev Barua,et al.  The sensitivity of communication mechanisms to bandwidth and latency , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[27]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[28]  T. KingC.,et al.  Pipelined Data Parallel Algorithms-I , 1990 .

[29]  P. Wyckoff,et al.  EMP: Zero-Copy OS-Bypass NIC-Driven Gigabit Ethernet Message Passing , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[30]  Yutaka Ishikawa,et al.  MPICH-PM: Design and Implementation of Zero Copy MPI for PM , 1998 .

[31]  Hermann Hellwagner The SCI Standard and Applications of SCI , 1999, Scalable Coherent Interface.

[32]  Nectarios Koziris,et al.  Scheduling of tiled nested loops onto a cluster with a fixed number of SMP nodes , 2004, 12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings..

[33]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[34]  Nectarios Koziris,et al.  A pipelined execution of tiled nested loops on SMPs with computation and communication overlapping , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[35]  Jang-Ping Sheu,et al.  Partitioning and mapping of nested loops for linear array multicomputers , 1995, The Journal of Supercomputing.

[36]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[37]  Jang-Ping Sheu,et al.  Partitioning and Mapping Nested Loops on Multiprocessor Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[38]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[39]  Hermann Hellwagner,et al.  SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Compute Clusters , 1999 .

[40]  Nectarios Koziris,et al.  Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[41]  Nectarios Koziris,et al.  Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays , 2000, IEEE Trans. Parallel Distributed Syst..

[42]  Caliper Corp Virtual interface architecture specification , 1997 .

[43]  Nectarios Koziris,et al.  Enhancing the performance of tiled loop execution onto clusters using memory mapped network interfaces and pipelined schedules , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[44]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[45]  Nectarios Koziris,et al.  Evaluation of loop grouping methods based on orthogonal projection spaces , 2000, Proceedings 2000 International Conference on Parallel Processing.

[46]  Nectarios Koziris,et al.  Optimal Scheduling for UET/UET-UCT Generalized n-Dimensional Grid Task Graphs , 1999, J. Parallel Distributed Comput..

[47]  Larry Carter,et al.  On the Parallel Execution Time of Tiled Loops , 2003, IEEE Trans. Parallel Distributed Syst..

[48]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[49]  Richard P. Martin,et al.  Modeling communication pipeline latency , 1998, SIGMETRICS '98/PERFORMANCE '98.

[50]  Andrew A. Chien,et al.  Software overhead in messaging layers: where does the time go? , 1994, ASPLOS VI.