Parallelization Approaches for Hardware Accelerators - Loop Unrolling Versus Loop Partitioning

State-of-the-art behavioral synthesis tools barely have high-level transformations in order to achieve highly parallelized implementations. If any, they apply loop unrolling to obtain a higher throughput. In this paper, we employ the PARO behavioral synthesis tool which has the unique ability to perform both loop unrolling or loop partitioning. Loop unrolling replicates the loop kernel and exposes the parallelism for hardware implementation, whereas partitioning tiles the loop program onto a regular array consisting of tightly coupled processing elements. The usage of the same design tool for both the variants enables for the first time, a quantitative evaluation of the two approaches for reconfigurable architectures with help of computationally intensive algorithms selected from different benchmarks. Superlinear speedups in terms of throughput are accomplished for the processor array approach. In addition, area and power cost are reduced.

[1]  Jürgen Teich,et al.  Resource constrained and speculative scheduling of an algorithm class with run-time dependent conditionals , 2004 .

[2]  Ed F. Deprettere,et al.  Expression synthesis in process networks generated by LAURA , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[3]  Jürgen Teich,et al.  A highly parameterizable parallel processor array architecture , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[4]  Jürgen Teich,et al.  Hierarchical Partitioning for Piecewise Linear Algorithms , 2006, International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06).

[5]  Jürgen Teich,et al.  PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications , 2008, ARC.

[6]  Nikil D. Dutt,et al.  SPARK: a high-level synthesis framework for applying parallelizing compiler transformations , 2003, 16th International Conference on VLSI Design, 2003. Proceedings..

[7]  Pedro C. Diniz,et al.  Modeling Loop Unrolling: Approaches and Open Issues , 2004, SAMOS.

[8]  Preeti Ranjan Panda,et al.  The Impact of Loop Unrolling on Controller Delay in High Level Synthesis , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[9]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[10]  Stamatis Vassiliadis,et al.  Computer Systems: Architectures, Modeling, and Simulation , 2004, Lecture Notes in Computer Science.

[11]  Christian Lengauer,et al.  Loop Parallelization in the Polytope Model , 1993, CONCUR.

[12]  Paul Pinella,et al.  Mentor Graphics Corp. , 1993 .

[13]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[14]  Patrice Quinton,et al.  Hardware synthesis for multi-dimensional time , 2003, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003.

[15]  Jingling Xue,et al.  Unimodular Transformations of Non-Perfectly Nested Loops , 1997, Parallel Comput..