Symbolic loop parallelization for balancing I/O and memory accesses on processor arrays

Loop parallelization techniques for massively parallel processor arrays using one-level tiling are often either I/O- or memory-bounded, exceeding the target architecture's capabilities. Furthermore, if the number of available processing elements is only known at runtime - as in adaptive systems - static approaches fail. To solve these problems, we present a hybrid compile/runtime technique to symbolically parallelize loop nests with uniform dependences on multiple levels. At compile time, two novel transformations are performed: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. By tuning the size of the tiles on multiple levels, a trade-off between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the number of tiles; thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number is known, a simple prolog chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In this way, our approach dynamically chooses latency-optimal and feasible schedules while avoiding expensive re-compilations.

[1]  Jürgen Teich,et al.  Invasive Algorithms and Architectures Invasive Algorithmen und Architekturen , 2008, it Inf. Technol..

[2]  Sanjay V. Rajopadhye,et al.  Efficient Tiled Loop Generation: D-Tiling , 2009, LCPC.

[3]  Jürgen Teich,et al.  Symbolic inner loop parallelisation for massively parallel processor arrays , 2014, 2014 Twelfth ACM/IEEE Conference on Formal Methods and Models for Codesign (MEMOCODE).

[4]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[5]  B. Ramakrishna Rau,et al.  A Constructive Solution to the Juggling Problem in Systolic Array Synthesis , 2000 .

[6]  Sanjay V. Rajopadhye,et al.  Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[7]  Sanjay V. Rajopadhye,et al.  Parameterized loop tiling , 2012, TOPL.

[8]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[9]  J. Ramanujam,et al.  Parametric Tiling of Affine Loop Nests , 2010 .

[10]  Jürgen Teich,et al.  Symbolic Mapping of Loop Programs onto Processor Arrays , 2014, J. Signal Process. Syst..

[11]  Lothar Thiele,et al.  On the design of piecewise regular processor arrays , 1989, IEEE International Symposium on Circuits and Systems,.

[12]  Oscar H. Ibarra,et al.  On symbolic scheduling and parallel complexity of loops , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[13]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[14]  Yves Robert,et al.  Linear scheduling is close to optimality , 1992, [1992] Proceedings of the International Conference on Application Specific Array Processors.

[15]  J. Ramanujam,et al.  DynTile: Parametric tiled loop generation for parallel execution on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16]  Yves Robert,et al.  Affine-by-Statement Scheduling of Uniform and Affine Loop Nests over Parametric , 1995, J. Parallel Distributed Comput..

[17]  Uwe Eckhardt,et al.  Hierarchical algorithm partitioning at system level for an improved utilization of memory structures , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[18]  Sanjay V. Rajopadhye,et al.  Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[19]  Frédéric Vivien,et al.  A constructive solution to the juggling problem in processor array synthesis , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[20]  Tomofumi Yuki,et al.  Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs , 2013 .

[21]  Frank Hannig,et al.  Invasive Tightly-Coupled Processor Arrays , 2014, ACM Trans. Embed. Comput. Syst..