Symbolic parallelization of loop programs for massively parallel processor arrays

In this paper, we present a first solution to the unsolved problem of joint tiling and scheduling a given loop nest with uniform data dependencies symbolically. This problem arises for loop programs for which the iterations shall be optimally scheduled on a processor array of unknown size at compile-time. Still, we show that it is possible to derive parameterized latencyoptimal schedules statically by proposing two new program transformations: In the first step, the iteration space is tiled symbolically into orthotopes of parametrized extensions. The resulting tiled program is subsequently scheduled symbolically. Here, we show that the maximal number of potential optimal schedules is upper bounded by 2nn! where n is the dimension of the loop nest. However, the real number of optimal schedule candidates being much less than this. At run-time, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically activated and the corresponding program configuration executed on the resulting processor array so to avoid any further run-time optimization or expensive recompilations.

[1]  Jürgen Teich,et al.  Invasive Computing: An Overview , 2011, Multiprocessor System-on-Chip.

[2]  Forrest Brewer,et al.  On applicability of symbolic techniques to larger scheduling problems , 1995, Proceedings the European Design and Test Conference. ED&TC 1995.

[3]  Lothar Thiele,et al.  On the design of piecewise regular processor arrays , 1989, IEEE International Symposium on Circuits and Systems,.

[4]  Oscar H. Ibarra,et al.  On symbolic scheduling and parallel complexity of loops , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[5]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[6]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[7]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[8]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[9]  Jingling Xue,et al.  Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs , 2012, 2012 41st International Conference on Parallel Processing.

[10]  B. Ramakrishna Rau,et al.  A Constructive Solution to the Juggling Problem in Systolic Array Synthesis , 2000 .

[11]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[12]  Sanjay V. Rajopadhye,et al.  Parameterized loop tiling , 2012, TOPL.

[13]  Frédéric Vivien,et al.  A constructive solution to the juggling problem in processor array synthesis , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[14]  Frank Hannig,et al.  Scheduling Techniques for High-Throughput Loop Accelerators , 2009 .

[15]  Jürgen Teich,et al.  PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications , 2008, ARC.

[16]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[17]  J. Ramanujam,et al.  DynTile: Parametric tiled loop generation for parallel execution on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[18]  Jürgen Teich,et al.  A highly parameterizable parallel processor array architecture , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[19]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[20]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[21]  J. Ramanujam,et al.  Parametric Tiling of Affine Loop Nests , 2010 .

[22]  S. Mahlke,et al.  Multicore compilation strategies and challenges , 2009, IEEE Signal Processing Magazine.

[23]  Yves Robert,et al.  Affine-by-Statement Scheduling of Uniform and Affine Loop Nests over Parametric , 1995, J. Parallel Distributed Comput..

[24]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[25]  I. Radivojevic,et al.  Symbolic Scheduling Techniques , 1995, IEICE Trans. Inf. Syst..

[26]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.