Tiling for Dynamic Scheduling

Tiling is a key transformation used for coarsening the granularity of parallelism and improving locality. It is known that current stateof-the-art compiler approaches for tiling affine loop nests make use of sufficient, i.e., conservative conditions for the validity of tiling. These conservative conditions, which are used for static scheduling, miss tiling schemes for which the tile schedule is not easy to describe statically. However, the partial order of the tiles can be expressed using dependence relations which can be used for dynamic scheduling at runtime. Another set of opportunities are missed due to the classic reason that finding valid tiling hyperplanes is often harder than checking whether a given tiling is valid. Though the conservative conditions for validity of tiling have worked in practice on a large number of codes, we show that they fail to find the desired tiling in several cases ‐ some of these have dependence patterns similar to real world problems and applications. We then look at ways to improve current techniques to address this issue. To quantify the potential of the improved techniques, we manually tile two dynamic programming algorithms ‐ the Floyd-Warshall algorithm, and Zuker’s RNA secondary structure prediction and report their performance on a shared memory multicore. Our 3-d tiled dynamically scheduled implementation of Zuker’s algorithm outperforms an optimized multi-core implementation GTfold by a factor of 2.38. Such a 3-d tiling was possible only by reasoning with more precise validity conditions.

[1]  Sartaj Sahni,et al.  A blocked all-pairs shortest-paths algorithm , 2003, ACM J. Exp. Algorithmics.

[2]  Martin Griebl,et al.  Index Set Splitting , 2000, International Journal of Parallel Programming.

[3]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[4]  David A. Bader,et al.  GTfold: a scalable multicore code for RNA secondary structure prediction , 2009, SAC '09.

[5]  Sanjay V. Rajopadhye,et al.  Smashing: Folding Space to Tile through Time , 2008, LCPC.

[6]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[7]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[8]  Martin Griebl,et al.  Forward Communication Only Placements and Their Use for Parallel Program Construction , 2002, LCPC.

[9]  Uday Bondhugula,et al.  Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors , 2009, PPoPP '09.

[10]  Monica S. Lam,et al.  Maximizing parallelism and minimizing synchronization with affine transforms , 1997, POPL '97.

[11]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[12]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[13]  Peter R. Cappello,et al.  Converting affine recurrence equations to quasi-uniform recurrence equations , 1995, J. VLSI Signal Process..

[14]  Dominique Lavenier,et al.  GPU Accelerated RNA Folding Algorithm , 2009, ICCS.

[15]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[16]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[17]  Martin Griebl,et al.  Automatic Parallelization of Loop Programs for Distributed Memory Architectures , 2004 .

[18]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[19]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[20]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[21]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[22]  Vijaya Ramachandran,et al.  Cache-efficient dynamic programming algorithms for multicores , 2008, SPAA '08.

[23]  William Pugh,et al.  Iteration space slicing and its application to communication optimization , 1997, ICS '97.

[24]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[25]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[26]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.