Custom Multicache Architectures for Heap Manipulating Programs

Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an field-programmable gate array accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that automatically generates parallel multicache systems which are tailored to the specific requirements of the application. Our program analysis identifies nonoverlapping memory regions, supported by private caches, and regions which are shared by parallel units after parallelization, which are supported by coherent caches and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this paper is the focus on programs using dynamically allocated, pointer-based data structures which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. Second, we devise a high-level cache performance estimation to find a heterogeneous configuration of cache sizes that maximizes the performance of the multicache system subject to an on-chip memory resource constraint. We demonstrate our technique with three case studies of applications using dynamic data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to $\boldsymbol {15}\boldsymbol {\times }$ speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid multicache architecture.

[1]  Paul Chow,et al.  FCache: a system for cache coherent processing on FPGAs , 2012, FPGA '12.

[2]  George A. Constantinides,et al.  Optimizing SDRAM bandwidth for custom FPGA loop accelerators , 2012, FPGA '12.

[3]  Kermin Fleming,et al.  Leap scratchpads: automatic memory and cache management for reconfigurable logic , 2010, FPGA '11.

[4]  Kermin Fleming,et al.  LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[5]  Felix Winterstein,et al.  Scavenger: Automating the construction of application-optimized memory hierarchies , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[6]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[7]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[8]  John Wawrzynek,et al.  Exploiting Memory-Level Parallelism in Reconfigurable Accelerators , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[9]  George A. Constantinides,et al.  Custom-sized caches in application-specific memory hierarchies , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[10]  George A. Constantinides,et al.  High-level synthesis of dynamic data structures: A case study using Vivado HLS , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[11]  Jason Helge Anderson,et al.  Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[12]  Philippa Gardner,et al.  Automatic Parallelization with Separation Logic , 2009, ESOP.

[13]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[14]  Dirk Stroobandt,et al.  An overview of today’s high-level synthesis tools , 2012, Design Automation for Embedded Systems.

[15]  Ralph Wittig,et al.  Performance and power of cache-based reconfigurable computing , 2009, ISCA '09.

[16]  Arthur Charlesworth,et al.  The undecidability of associativity and commutativity analysis , 2002, TOPL.

[17]  George A. Constantinides,et al.  MATCHUP: Memory Abstractions for Heap Manipulating Programs , 2015, FPGA.

[18]  Peter W. O'Hearn,et al.  Local Reasoning about Programs that Alter Data Structures , 2001, CSL.

[19]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Lesley Shannon,et al.  Design Space Exploration of L1 Data Caches for FPGA-Based Multiprocessor Systems , 2015, FPGA.

[21]  Kermin Fleming,et al.  The LEAP FPGA operating system , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[22]  Qiang Liu,et al.  Combining Data Reuse With Data-Level Parallelization for FPGA-Targeted Hardware Compilation: A Geometric Programming Framework , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[23]  George A. Constantinides,et al.  Separation Logic for High-Level Synthesis , 2015, ACM Trans. Reconfigurable Technol. Syst..

[24]  David Pisinger,et al.  A Minimal Algorithm for the 0-1 Knapsack Problem , 1997, Oper. Res..