Symbolic Mapping of Loop Programs onto Processor Arrays

In this paper, we present a solution to the problem of joint tiling and scheduling a given loop nest with uniform data dependencies symbolically. This challenge arises when the size and number of available processors for parallel loop execution is not known at compile time. But still, in order to avoid any overhead of dynamic (run-time) recompilation, a schedule of loop iterations shall be computed and optimized statically. In this paper, it will be shown that it is possible to derive parameterized latency-optimal schedules statically by proposing a two step approach: First, the iteration space of a loop program is tiled symbolically into orthotopes of parametrized extensions. Subsequently, the resulting tiled program is also scheduled symbolically, resulting in a set of latency-optimal parameterized schedule candidates. At run time, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically selected and the corresponding program configuration executed on the resulting processor array so to avoid any further run-time optimization or expensive recompilation. Our theory of symbolic loop parallelization is applied to a number of loop programs from the domains of signal processing and linear algebra. Finally, as a proof of concept, we demonstrate our proposed methodology for a massively parallel processor array architecture called tightly coupled processor array (TCPA) on which applications may dynamically claim regions of processors in the context of invasive computing.

[1]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[2]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[3]  Sanjay V. Rajopadhye,et al.  Parameterized loop tiling , 2012, TOPL.

[4]  Frédéric Vivien,et al.  A constructive solution to the juggling problem in processor array synthesis , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[5]  Yves Robert,et al.  Linear scheduling is close to optimality , 1992, [1992] Proceedings of the International Conference on Application Specific Array Processors.

[6]  Narayanan Vijaykrishnan,et al.  Run-time adaption for highly-complex multi-core systems , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[7]  Jürgen Teich,et al.  Mapping a class of dependence algorithms to coarse-grained reconfigurable arrays: architectural parameters and methodology , 2006, Int. J. Embed. Syst..

[8]  Frank Hannig,et al.  Invasive Tightly-Coupled Processor Arrays , 2014, ACM Trans. Embed. Comput. Syst..

[9]  Jürgen Teich,et al.  Exact Partitioning of Affine Dependence Algorithms , 2002, Embedded Processor Design Challenges.

[10]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[11]  Jürgen Teich,et al.  Partitioning Processor Arrays under Resource Constraints , 1997, J. VLSI Signal Process..

[12]  Jürgen Teich,et al.  Decentralized dynamic resource management support for massively parallel processor arrays , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[13]  Jürgen Teich,et al.  Scalable Many-Domain Power Gating in Coarse-Grained Reconfigurable Processor Arrays , 2011, IEEE Embedded Systems Letters.

[14]  B. Ramakrishna Rau,et al.  A Constructive Solution to the Juggling Problem in Systolic Array Synthesis , 2000 .

[15]  Oscar H. Ibarra,et al.  On symbolic scheduling and parallel complexity of loops , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[16]  J. Ramanujam,et al.  DynTile: Parametric tiled loop generation for parallel execution on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[17]  Jürgen Teich,et al.  A highly parameterizable parallel processor array architecture , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[18]  Jingling Xue,et al.  Model-Driven Tile Size Selection for DOACROSS Loops on GPUs , 2011, Euro-Par.

[19]  Yves Robert,et al.  Affine-by-Statement Scheduling of Uniform and Affine Loop Nests over Parametric , 1995, J. Parallel Distributed Comput..

[20]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[21]  Jürgen Teich,et al.  Towards Symbolic Run-Time Reconfiguration in Tightly-Coupled Processor Arrays , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[22]  Jürgen Teich,et al.  Hierarchical power management for adaptive tightly-coupled processor arrays , 2013, TODE.

[23]  Jürgen Teich,et al.  Distributed Resource Reservation in Massively Parallel Processor Arrays , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[24]  Lothar Thiele,et al.  On the design of piecewise regular processor arrays , 1989, IEEE International Symposium on Circuits and Systems,.

[25]  Jürgen Teich,et al.  System integration of tightly-coupled processor arrays using reconfigurable buffer structures , 2013, CF '13.

[26]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[27]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[28]  Jürgen Teich,et al.  Resource-aware programming and simulation of MPSoC architectures through extension of X10 , 2011, SCOPES.

[29]  S. Mahlke,et al.  Multicore compilation strategies and challenges , 2009, IEEE Signal Processing Magazine.

[30]  Jürgen Teich,et al.  Scheduling of partitioned regular algorithms on processor arrays with constrained resources , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.

[31]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[32]  I. Radivojevic,et al.  Symbolic Scheduling Techniques , 1995, IEICE Trans. Inf. Syst..

[33]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[34]  Jürgen Teich,et al.  Invasive computing - Concepts and overheads , 2012, Proceeding of the 2012 Forum on Specification and Design Languages.

[35]  Frank Hannig,et al.  Scheduling Techniques for High-Throughput Loop Accelerators , 2009 .

[36]  Jürgen Teich,et al.  Invasive Algorithms and Architectures Invasive Algorithmen und Architekturen , 2008, it Inf. Technol..

[37]  Thomas Kailath,et al.  Regular iterative algorithms and their implementation on processor arrays , 1988, Proc. IEEE.

[38]  Jürgen Teich,et al.  Invasive Computing: An Overview , 2011, Multiprocessor System-on-Chip.

[39]  Jürgen Teich,et al.  Symbolic parallelization of loop programs for massively parallel processor arrays , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[40]  Jürgen Teich,et al.  Loop program mapping and compact code generation for programmable hardware accelerators , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[41]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[42]  Weijia Shang,et al.  Time Optimal Linear Schedules for Algorithms with Uniform Dependencies , 1991, IEEE Trans. Computers.

[43]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[44]  Jürgen Teich,et al.  PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications , 2008, ARC.

[45]  Jingling Xue,et al.  Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs , 2012, 2012 41st International Conference on Parallel Processing.