Polyhedral fragments: an efficient representation for symbolically generating code for processor arrays

To leverage the vast parallelism of loops, embedded loop accelerators often take the form of processor arrays with many, but simple processing elements. Each processing element executes a subset of a loop's iterations in parallel using instruction- and datalevel parallelism by tightly scheduling iterations using software pipelining and packing instructions into compact, individual programs. However, loop bounds are often unknown until runtime, which complicates the static generation of programs because they influence each program's control flow. Existing solutions, like generating and storing all possible programs or full just-in-time compilation, are prohibitively expensive, especially in embedded systems. As a remedy, we propose a hybrid approach introducing a tree-like program representation, whose generation front-loads all intractable sub-problems to compile time, and from which all concrete program variants can efficiently be stitched together at runtime. The tree consists of so-called polyhedral fragments that represent concrete program parts and are annotated with iteration-dependent conditions. We show that both this representation is both space- and time-efficient: it requires polynomial space to store---whereas storing all possibly generated programs is non-polynomial---and polynomial time to evaluate---whereas just-in-time compilation requires solving NP-hard problems. In a case study, we show for a representative loop program that using a tree of polyhedral fragments saves 98.88 % of space compared to storing all program variants.

[1]  Frank Hannig,et al.  Scheduling Techniques for High-Throughput Loop Accelerators , 2009 .

[2]  Philippe Clauss,et al.  Code Bones: Fast and Flexible Code Generation for Dynamic and Speculative Polyhedral Optimization , 2016, Euro-Par.

[3]  Jürgen Teich,et al.  Loop program mapping and compact code generation for programmable hardware accelerators , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[4]  Sanjay V. Rajopadhye,et al.  Efficient Tiled Loop Generation: D-Tiling , 2009, LCPC.

[5]  Jürgen Teich,et al.  Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays , 2016, 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[6]  Philippe Coussy,et al.  High-Level Synthesis: from Algorithm to Digital Circuit , 2008 .

[7]  J. Ramanujam,et al.  Parametric GPU Code Generation for Affine Loop Programs , 2013, LCPC.

[8]  E. Deprettere,et al.  Automatic design and partitioning of systolic/wavefront arrays for VLSI , 1988 .

[9]  Thomas Kailath,et al.  Regular iterative algorithms and their implementation on processor arrays , 1988, Proc. IEEE.

[10]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[11]  J. Ramanujam,et al.  DynTile: Parametric tiled loop generation for parallel execution on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[12]  Jürgen Teich,et al.  Compact Code Generation for Tightly-Coupled Processor Arrays , 2014, J. Signal Process. Syst..

[13]  Frank Hannig,et al.  Invasive Tightly-Coupled Processor Arrays , 2014, ACM Trans. Embed. Comput. Syst..

[14]  Jürgen Teich A compiler for application specific processor arrays , 1993 .

[15]  Vincent Loechner,et al.  Dynamic and Speculative Polyhedral Parallelization Using Compiler-Generated Skeletons , 2013, International Journal of Parallel Programming.

[16]  Jürgen Teich,et al.  Scheduling of partitioned regular algorithms on processor arrays with constrained resources , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.

[17]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[18]  Jürgen Teich,et al.  Orthogonal Instruction Processing: An Alternative to Lightweight VLIW Processors , 2017, 2017 IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC).

[19]  Jürgen Teich,et al.  A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template , 2006, ReCoSoC.

[20]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.