Compact Code Generation for Tightly-Coupled Processor Arrays

In this paper, we consider programmable tightly-coupled processor arrays consisting of interconnected small light-weight VLIW cores, which can exploit both loop-level parallelism and instruction-level parallelism. These arrays are well suited for compute-intensive nested loop applications often providing a higher power and area efficiency compared with commercial off-the-shelf processors. They are ideal candidates for accelerating the computation of nested loop programs in future heterogeneous systems, where energy efficiency is one of the most important design goals for overall system-on-chip design. In this context, we present a novel design methodology for the mapping of nested loop programs onto such processor arrays. Key features of our approach are: (1) Design entry in form of a functional programming language and loop parallelization in the polyhedron model, (2) support of zero-overhead looping not only for innermost loops but also for arbitrarily nested loops. Processors of such arrays are often limited in instruction memory size to reduce the area and power consumption. Hence, (3) we present methods for code compaction and code generation, and integrated these methods into a design tool. Finally, (4) we evaluated selected benchmarks by comparing our code generator with the Trimaran and VEX compiler frameworks. As the results show, our approach can reduce the size of the generated processor codes up to 64 % (Trimaran) and 55 % (VEX) while at the same time achieving a significant higher throughput.

[1]  Lothar Thiele,et al.  Resource constrained scheduling of uniform algorithms , 1993, J. VLSI Signal Process..

[2]  Jürgen Teich,et al.  Accuracy and performance analysis of Harris Corner computation on tightly-coupled processor arrays , 2013, 2013 Conference on Design and Architectures for Signal and Image Processing.

[3]  Rudy Lauwereins,et al.  Design methodology for a tightly coupled VLIW/reconfigurable matrix architecture: a case study , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[4]  Paul Feautrier,et al.  Polyhedron Model , 2011, Encyclopedia of Parallel Computing.

[5]  Jürgen Teich A compiler for application specific processor arrays , 1993 .

[6]  Rudy Lauwereins,et al.  DRESC: a retargetable compiler for coarse-grained reconfigurable architectures , 2002, 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings..

[7]  Jürgen Teich,et al.  Resource constrained and speculative scheduling of an algorithm class with run-time dependent conditionals , 2004 .

[8]  Christian Lengauer,et al.  Towards systolizing compilation , 1991, Distributed Computing.

[9]  Jürgen Teich,et al.  Mapping a class of dependence algorithms to coarse-grained reconfigurable arrays: architectural parameters and methodology , 2006, Int. J. Embed. Syst..

[10]  Jürgen Teich,et al.  A highly parameterizable parallel processor array architecture , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[11]  Jürgen Teich,et al.  Loop program mapping and compact code generation for programmable hardware accelerators , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[12]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[13]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[14]  Lothar Thiele,et al.  On the hierarchical design of VLSI processor arrays , 1988, 1988., IEEE International Symposium on Circuits and Systems.

[15]  Jürgen Teich,et al.  Partitioning of processor arrays: a piecewise regular approach , 1993, Integr..

[16]  Jürgen Teich,et al.  A prototype of an invasive tightly-coupled processor array , 2012, Proceedings of the 2012 Conference on Design and Architectures for Signal and Image Processing.

[17]  Jürgen Teich,et al.  Hierarchical Partitioning for Piecewise Linear Algorithms , 2006, International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06).

[18]  Kiyoung Choi,et al.  An algorithm for mapping loops onto coarse-grained reconfigurable architectures , 2003 .

[19]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[20]  Sumit Gupta,et al.  SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits , 2004 .

[21]  David B. Whalley,et al.  Effective exploitation of a zero overhead loop buffer , 1999, LCTES '99.

[22]  Jürgen Teich,et al.  A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template , 2006, ReCoSoC.

[23]  Fadi J. Kurdahi,et al.  Automatic compilation to a coarse-grained reconfigurable system-opn-chip , 2003, TECS.

[24]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[25]  Jürgen Teich,et al.  The PAULA Language for Designing Multi-Dimensional Dataflow-Intensive Applications , 2008, MBMV.

[26]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[27]  Frank Hannig,et al.  Invasive Tightly-Coupled Processor Arrays , 2014, ACM Trans. Embed. Comput. Syst..

[28]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[29]  Jürgen Teich,et al.  Partitioning Processor Arrays under Resource Constraints , 1997, J. VLSI Signal Process..

[30]  D.I. Moldovan,et al.  On the design of algorithms for VLSI systolic arrays , 1983, Proceedings of the IEEE.

[31]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[32]  Jürgen Teich,et al.  PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications , 2008, ARC.

[33]  Jürgen Teich,et al.  High-Level Synthesis Revised - Generation of FPGA Accelerators from a Domain-Specific Language using the Polyhedron Model , 2013, PARCO.

[34]  Francky Catthoor,et al.  Compilation Technique for Loop Overhead Minimization , 2009, 2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools.

[35]  Frank Hannig,et al.  Scheduling Techniques for High-Throughput Loop Accelerators , 2009 .

[36]  Christian Lengauer,et al.  Loop Parallelization in the Polytope Model , 1993, CONCUR.