Supernode transformation on GPGPUs

ABSTRACT Supernode transformation, or tiling, is a technique that partitions algorithms to improve data locality and parallelism to achieve shortest running time. It groups multiple iterations of nested loops into supernodes to be assigned to processors for processing in parallel. A supernode transformation can be described by supernode size and shape. This paper focuses on supernode transformation on General Purpose Graphic Processing Units (GPGPUs), including supernode scheduling, supernode mapping to GPGPU blocks, and the finding of the optimal supernode size, for achieving the shortest total running time. The algorithms considered are two nested loops with regular data dependencies. The Longest Common Subsequence problem is used as an illustration. A novel mathematical model for the total running time is established as a function of the supernode size, algorithm parameters such as the problem size and the data dependence, the computation time of each loop iteration, architecture parameters such as the number of GPGPU blocks, and the communication cost. The optimal supernode size is derived from this closed form model. The model and the optimal supernode size provide better results than previous research and are verified by simulations on GPGPUs. Iterations in a two-dimensional uniform dependence algorithm iteration space of , shown as the intersections in the picture, can be grouped into rectangles of known as a tile or a supernode. This process is called supernode transformation or tiling. It reduces the inter-iteration communication cost thus improves the total execution time. The supernodes on the same wavefront may be scheduled on GPU to be processed at the same time, each by a GPU block. The size of the tile, , plays an important role in this transformation. The optimal size can lead to minimal total execution time. Graphical Abstract

[1]  Keshav Pingali,et al.  Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests , 2001, International Journal of Parallel Programming.

[2]  Nectarios Koziris,et al.  Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[3]  Boleslaw K. Szymanski,et al.  Finding Optimum Wavefront of Parallel Computation , 1994, Parallel Algorithms Appl..

[4]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[5]  Hiroshi Imai,et al.  Parallel Multiple Alignments and Their Implementation on CM5 , 1993 .

[6]  Jack Dongarra,et al.  Tiling for Heterogeneous Computing Platforms , 2006 .

[7]  Weijia Shang,et al.  On Supernode Transformation with Minimized Total Running Time , 1998, IEEE Trans. Parallel Distributed Syst..

[8]  A. Jeffrey Complex Analysis and Applications , 1991 .

[9]  Yves Robert,et al.  Tiling with limited resources , 1997, Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors.

[10]  Weijia Shang,et al.  On Time Optimal Supernode Shape , 2002, IEEE Trans. Parallel Distributed Syst..

[11]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[12]  Nectarios Koziris,et al.  Scheduling of tiled nested loops onto a cluster with a fixed number of SMP nodes , 2004, 12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings..

[13]  Monica S. Lam,et al.  Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.

[14]  David Parello,et al.  Facilitating the search for compositions of program transformations , 2005, ICS '05.

[15]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[16]  Hiroshi Ohta,et al.  Optimal tile size adjustment in compiling general DOACROSS loop nests , 1995, ICS '95.

[17]  Weijia Shang,et al.  Time Optimal Linear Schedules for Algorithms with Uniform Dependencies , 1991, IEEE Trans. Computers.

[18]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[19]  Jiaoyun Yang,et al.  An Efficient Parallel Algorithm for Longest Common Subsequence Problem on GPUs , 2010 .

[20]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[21]  Nectarios Koziris,et al.  Minimizing completion time for loop tiling with computation and communication overlapping , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[22]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[23]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.