Improving HLS Generated Accelerators Through Relaxed Memory Access Scheduling

High-Level-Synthesis can be used to generate hardware accelerators for compute intense software parts (so called kernels). For meaningful acceleration, such kernels should be able to autonomously access the memory. Unfortunately, such memory accesses can constitute dependences (e.g. writing an array before reading from it) leading to bottlenecks. The analysis of potential conflicts of memory accesses is often difficult and in many cases not even possible. In order to improve the scheduling of memory accesses, we propose a novel methodology to fully automatically place bypasses and squashes into the data flow graph that is used to generate the hardware accelerator. Evaluating our approach with the Powerstone benchmark suite, we can show that execution time is reduced on average by 6.5%.

[1]  Shreesha Srinath,et al.  Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis , 2017, FPGA.

[2]  Gerald Hempel,et al.  A resource optimized Processor Core for FPGA based SoCs , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[3]  Christian Hochberger,et al.  Using GCC Analysis Techniques to Enable Parallel Memory Accesses in HLS , 2017 .

[4]  George A. Constantinides,et al.  Run fast when you can: Loop pipelining with uncertain and non-uniform memory dependencies , 2017, 2017 51st Asilomar Conference on Signals, Systems, and Computers.

[5]  Gerald Hempel,et al.  Towards GCC-based automatic soft-core customization , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[6]  James K. Archibald,et al.  Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[7]  Christian Hochberger,et al.  Update or Invalidate: Influence of Coherence Protocols on Configurable HW Accelerators , 2019, ARC.

[8]  Andreas Moshovos,et al.  Memory dependence speculation tradeoffs in centralized, continuous-window superscalar processors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[9]  Jason Helge Anderson,et al.  Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[10]  Steven Derrien,et al.  Runtime dependency analysis for loop pipelining in High-Level Synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Vito Giovanni Castellana,et al.  An adaptive Memory Interface Controller for improving bandwidth utilization of hybrid and reconfigurable systems , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Jason Helge Anderson,et al.  LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems , 2013, TECS.

[13]  Edwin Hsing-Mean Sha,et al.  Rotation Scheduling: A Loop Pipelining Algorithm , 1993, 30th ACM/IEEE Design Automation Conference.

[14]  A. Roth,et al.  Dynamic techniques for load and load-use scheduling , 2001, Proc. IEEE.

[15]  John Paul Shen,et al.  Speculative disambiguation: a compilation technique for dynamic memory disambiguation , 1994, ISCA '94.