Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. While existing HLS pipelining techniques obtain good performance with low complexity for regular loop nests, they provide inadequate support for effectively synthesizing irregular loop nests. For loop nests with dynamic-bound inner loops, current pipelining techniques require unrolling of the inner loops, which is either very expensive in resource or even inapplicable due to dynamic loop bounds. To address this major limitation, this paper proposes ElasticFlow, a novel architecture capable of dynamically distributing inner loops to an array of processing units (LPUs) in an area-efficient manner. The proposed LPUs can be either specialized to execute an individual inner loop or shared among multiple inner loops to balance the tradeoff between performance and area. A customized banked memory architecture is proposed to coordinate memory accesses among different LPUs to maximize memory bandwidth without significantly increasing memory footprint. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a state-of-the-art commercial HLS tool for Xilinx FPGAs.

[1]  Peng Li,et al.  Deadlock avoidance for streaming computations with filtering , 2010, SPAA '10.

[2]  Fabrizio Ferrandi,et al.  Exploiting Outer Loops Vectorization in High Level Synthesis , 2015, ARCS.

[3]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[4]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[5]  Hans Jurgen Mattausch,et al.  Fast quadratic increase of multiport-storage-cell area with port number , 1999 .

[6]  Feng Liu,et al.  CGPA: Coarse-Grained Pipelined Accelerators , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[7]  Zhiru Zhang,et al.  Area-efficient pipelining for FPGA-targeted high-level synthesis , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  Zhiru Zhang,et al.  Mapping-Aware Constrained Scheduling for LUT-Based FPGAs , 2015, FPGA.

[9]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Zhiru Zhang,et al.  ElasticFlow: A complexity-effective approach for pipelining irregular loop nests , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[12]  Zhiru Zhang,et al.  Multithreaded pipeline synthesis for data-parallel kernels , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[13]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[14]  Steven Derrien,et al.  Runtime dependency analysis for loop pipelining in High-Level Synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Zhiru Zhang,et al.  Flushing-enabled loop pipelining for high-level synthesis , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[16]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[17]  George A. Constantinides,et al.  High-level synthesis of dynamic data structures: A case study using Vivado HLS , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[18]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[19]  J. Ramanujam,et al.  Optimal software pipelining of nested loops , 1994, Proceedings of 8th International Parallel Processing Symposium.

[20]  John Freeman,et al.  OpenCL for FPGAs: Prototyping a Compiler , 2013 .

[21]  Yosi Ben-Asher,et al.  Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs , 2010, TRETS.

[22]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[23]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[24]  Zhiru Zhang,et al.  SDC-based modulo scheduling for pipeline synthesis , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[25]  Randolph E. Harr,et al.  Efficient pipelining of nested loops: unroll-and-squash , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.