Acceleration by Inline Cache for Memory-Intensive Algorithms on FPGA via High-Level Synthesis

Using FPGA-based acceleration of high-performance computing (HPC) applications to reduce energy and power consumption is becoming an interesting option, thanks to the availability of high-level synthesis (HLS) tools that enable fast design cycles. However, obtaining good performance for memory-intensive algorithms, which often exchange large data arrays with external DRAM, still requires time-consuming optimization and good knowledge of hardware design. This article proposes a new design methodology, based on dedicated application- and data array-specific caches. These caches provide most of the benefits that can be achieved by coding optimized DMA-like transfer strategies by hand into the HPC application code, but require only limited manual tuning (basically the selection of architecture and size), are neutral to target HLS tool and technology (FPGA or ASIC), and do not require changes to application code. We show experimental results obtained on five common memory-intensive algorithms from very diverse domains, namely machine learning, data sorting, and computer vision. We test the cost and performance of our caches against both out-of-the-box code originally optimized for a GPU, and manually optimized implementations specifically targeted for FPGAs via HLS. The implementation using our caches achieved an 8X speedup and 2X energy reduction on average with respect to out-of-the-box models using only simple directive-based optimizations (e.g., pipelining). They also achieved comparable performance with much less design effort when compared with the versions that were manually optimized to achieve efficient memory transfers specifically for an FPGA.

[1]  Ralph Wittig,et al.  Performance and power of cache-based reconfigurable computing , 2009, FPGA '09.

[2]  G Seliem Asmaa,et al.  Parallel Smith-Waterman Algorithm Hardware Implementation for Ancestors and Offspring Gene Tracer , 2016 .

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  George A. Constantinides,et al.  MATCHUP: Memory Abstractions for Heap Manipulating Programs , 2015, FPGA.

[5]  Basilio B. Fraguela,et al.  Adaptive line placement with the set balancing cache , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Witold R. Rudnicki,et al.  An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[8]  MeredithMichael,et al.  SystemCoDesigneran automatic ESL synthesis approach by design space exploration and behavioral synthesis for streaming applications , 2009 .

[9]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[10]  Luciano Lavagno,et al.  High Performance and Low Power Monte Carlo Methods to Option Pricing Models via High Level Design and Synthesis , 2016, 2016 European Modelling Symposium (EMS).

[11]  Dionisios N. Pnevmatikatos,et al.  FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[12]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[13]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[14]  Jason Cong,et al.  An energy-efficient adaptive hybrid cache , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[15]  Jiarong Tong,et al.  A high utilization rate routing algorithm for modern FPGA , 2008, 2008 9th International Conference on Solid-State and Integrated-Circuit Technology.

[16]  Hesham F. A. Hamed,et al.  Parallel Smith-Waterman Algorithm Hardware Implementation for Ancestors and Offspring Gene Tracer , 2016, 2016 World Symposium on Computer Applications & Research (WSCAR).

[17]  Luciano Lavagno,et al.  Energy-efficient FPGA Implementation of the k-Nearest Neighbors Algorithm Using OpenCL , 2016, FedCSIS.

[18]  Luciano Lavagno,et al.  Efficient FPGA Implementation of OpenCL High-Performance Computing Applications via High-Level Synthesis , 2017, IEEE Access.

[19]  John Wawrzynek,et al.  Exploiting Memory-Level Parallelism in Reconfigurable Accelerators , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[20]  J.-Y. Bouguet,et al.  Pyramidal implementation of the lucas kanade feature tracker , 1999 .

[21]  Joseph M. Lancaster,et al.  A Banded Smith-Waterman FPGA Accelerator for Mercury BLASTP , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[22]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[23]  Jason Helge Anderson,et al.  Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[24]  Michael Fingeroff,et al.  High-Level Synthesis Blue Book , 2010 .

[25]  Yale N. Patt,et al.  The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[26]  George A. Constantinides,et al.  Custom-sized caches in application-specific memory hierarchies , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[27]  Kermin Fleming,et al.  Leap scratchpads: automatic memory and cache management for reconfigurable logic , 2010, FPGA '11.

[28]  Christian Haubelt,et al.  SystemCoDesigner—an automatic ESL synthesis approach by design space exploration and behavioral synthesis for streaming applications , 2009, TODE.

[29]  Lesley Shannon,et al.  Design Space Exploration of L1 Data Caches for FPGA-Based Multiprocessor Systems , 2015, FPGA.

[30]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[31]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.