Using buffer-to-BRAM mapping approaches to trade-off throughput vs. memory use

One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance and meet the on-chip memory budget requirements, the designer faces the burden of manually assigning application buffers to physical on-chip memories. Mismatches between dimensions (bit-width and depth) of buffers and physical on-chip memories lead to underutilized memories. Memory utilization can be increased via buffer packing - grouping buffers together and implementing them as a single memory, at the expense of data throughput. However, identifying buffer groups that result in the least amount of physical memory is a combinatorial problem with a large search space. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. Previous work [1] introduced a tool that provides high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. This paper extends the previous work by introducing two low-level pragmas that specify information about memory access patterns, resulting in an improved on-chip memory utilization up to 22%. Further, we develop a simulated annealing based buffer packing algorithm, which reduces the tool's run-time from over 30 mins down to 15 sec, with an improvement in performance in the generated memory solution. Finally, we demonstrate the effectiveness of our tool with four stream application benchmarks.

[1]  Implementing FPGA Design with the OpenCL Standard , 2010 .

[2]  Jason Cong,et al.  Automatic memory partitioning and scheduling for throughput and power optimization , 1999, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[3]  Jonathan Rose,et al.  Definition and solution of the memory packing problem for field-programmable systems , 1994, ICCAD.

[4]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[5]  Tarek S. Abdelrahman,et al.  hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  Andy Gean Ye,et al.  Analysis and architecture design of scalable fractional motion estimation for H.264 encoding , 2012, Integr..

[7]  Edward H. Adelson,et al.  PYRAMID METHODS IN IMAGE PROCESSING. , 1984 .

[8]  David Pellerin,et al.  Practical FPGA programming in C , 2005 .

[9]  Pedro C. Diniz,et al.  A compiler approach to managing storage and memory bandwidth in configurable architectures , 2008, TODE.

[10]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[11]  Francky Catthoor,et al.  Incremental hierarchical memory size estimation for steering of loop transformations , 2007, TODE.

[12]  Jason Cong,et al.  Optimizing memory hierarchy allocation with loop transformations for high-level synthesis , 2012, DAC Design Automation Conference 2012.

[13]  Gang Wang,et al.  Storage assignment during high-level synthesis for configurable architectures , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[14]  Herman Schmit,et al.  Synthesis of application-specific memory designs , 1997, IEEE Trans. Very Large Scale Integr. Syst..

[15]  Nikil D. Dutt,et al.  Local memory exploration and optimization in embedded systems , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[16]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[17]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.