Symbolic inner loop parallelisation for massively parallel processor arrays

This paper presents a first solution to the unsolved problem of symbolically scheduling a given loop nest with uniform data dependences using inner loop parallelization, in particular, the locally parallel, globally sequential (LPGS) mapping technique. This technique is needed in the case of loop program specifications for which the iterations shall be scheduled on a processor array of unknown size at compile time while keeping the local memory consumption independent of the problem size of the mapped loop nest. We show that it is possible to derive such parameterized LPGS schedules statically by proposing a mixed compile-/runtime approach: At compile time, we first determine the set of all schedule candidates, each latency-optimal for a different scanning order of the loop nest. Then we devise an exact parameterized formula for determining the latency of the resulting symbolic schedules, thus making each schedule fully predictable. At runtime, once the size of the processor array becomes known, a simple prolog selects the overall latency-optimal schedule that is then dynamically activated and executed on the processor array. Hence, our approach avoids any further runtime optimization and expensive re-compilations while achieving the same results as computing an optimal static schedule for each possible combination of array and problem size.

[1]  Narayanan Vijaykrishnan,et al.  Run-time adaption for highly-complex multi-core systems , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[2]  Yves Robert,et al.  Affine-by-Statement Scheduling of Uniform and Affine Loop Nests over Parametric , 1995, J. Parallel Distributed Comput..

[3]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[4]  Jürgen Teich,et al.  Symbolic Mapping of Loop Programs onto Processor Arrays , 2014, J. Signal Process. Syst..

[5]  Uday Bondhugula,et al.  Automatic mapping of nested loops to FPGAS , 2007, PPoPP.

[6]  Karl-Heinz Zimmermann,et al.  Optimal piecewise linear schedules for LSGP- and LPGS-decomposed array processors via quadratic programming , 2001, FME 2001.

[7]  Jürgen Teich,et al.  Resource-aware programming and simulation of MPSoC architectures through extension of X10 , 2011, SCOPES.

[8]  Uwe Eckhardt,et al.  Scheduling in co-partitioned array architectures , 1997, Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors.

[9]  I. Radivojevic,et al.  Symbolic Scheduling Techniques , 1995, IEICE Trans. Inf. Syst..

[10]  Dan I. Moldovan,et al.  Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[11]  Lothar Thiele,et al.  On the design of piecewise regular processor arrays , 1989, IEEE International Symposium on Circuits and Systems,.

[12]  Frank Hannig,et al.  Invasive Tightly-Coupled Processor Arrays , 2014, ACM Trans. Embed. Comput. Syst..

[13]  Jürgen Teich,et al.  Invasive Computing: An Overview , 2011, Multiprocessor System-on-Chip.

[14]  Jürgen Teich,et al.  Symbolic parallelization of loop programs for massively parallel processor arrays , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[15]  Yves Robert,et al.  Linear scheduling is close to optimality , 1992, [1992] Proceedings of the International Conference on Application Specific Array Processors.

[16]  Jürgen Teich,et al.  Invasive Algorithms and Architectures Invasive Algorithmen und Architekturen , 2008, it Inf. Technol..

[17]  Karl-Heinz Zimmermann,et al.  A Unifying Lattice-Based Approach for the Partitioning of Systolic Arrays via LPGS and LSGP , 1997, J. VLSI Signal Process..

[18]  S. Mahlke,et al.  Multicore compilation strategies and challenges , 2009, IEEE Signal Processing Magazine.

[19]  Jürgen Teich,et al.  A highly parameterizable parallel processor array architecture , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[20]  B. Ramakrishna Rau,et al.  A Constructive Solution to the Juggling Problem in Systolic Array Synthesis , 2000 .

[21]  Oscar H. Ibarra,et al.  On symbolic scheduling and parallel complexity of loops , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[22]  Frédéric Vivien,et al.  A constructive solution to the juggling problem in processor array synthesis , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[23]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.