Rapid Memory-Aware Selection of Hardware Accelerators in Programmable SoC Design

Programmable Systems-on-Chips (SoCs) are expected to incorporate a larger number of application-specific hardware accelerators with tightly integrated memories in order to meet stringent performance-power requirements of embedded systems. As data sharing between the accelerator memories and the processor is inevitable, it is of paramount importance that the selection of application segments for hardware acceleration must be undertaken such that the communication overhead of data transfers do not impede the advantages of the accelerators. In this paper, we propose a novel memory-aware selection algorithm that is based on an iterative approach to rapidly recommend a set of hardware accelerators that will provide high performance gain under varying area constraint. In order to significantly reduce the algorithm runtime while still guaranteeing near-optimal solutions, we propose a heuristic to estimate the penalties incurred when the processor accesses the accelerator memories. In each iteration of the proposed algorithm, a two-pass method is employed where a set of good hardware accelerator candidates is selected using a greedy approach in the first pass, and a “sliding window” approach is used in the second pass to refine the solution. The two-pass method is iteratively performed on a bounded set of candidate hardware accelerators to limit the search space and to avoid local maxima. In order to validate the benefits of the proposed selection algorithm, an exhaustive search algorithm is also developed. Experimental results using the popular CHStone benchmark suite show that the performance achieved by the accelerators recommended by the proposed algorithm closely matches the performance of the exhaustive algorithm, with close to 99% accuracy, while being orders of magnitude faster.

[1]  Paolo Ienne,et al.  Speculative DMA for architecturally visible storage in instruction set extensions , 2008, CODES+ISSS '08.

[2]  Muhammad Shafique,et al.  mRTS: Run-time system for reconfigurable processors with multi-grained instruction-set extensions , 2011, 2011 Design, Automation & Test in Europe.

[3]  Srivaths Ravi,et al.  A Synthesis Methodology for Hybrid Custom Instruction and Coprocessor Generation for Extensible Processors , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Paolo Ienne,et al.  Virtual Ways: Efficient Coherence for Architecturally Visible Storage in Automatic Instruction Set Extensions , 2010, HiPEAC.

[5]  Wayne Luk,et al.  HW/SW Partitioning Algorithm Targeting MPSOC with Dynamic Partial Reconfigurable Fabric , 2015, 2015 14th International Conference on Computer-Aided Design and Computer Graphics (CAD/Graphics).

[6]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[7]  Jason Cong,et al.  Architecture support for custom instructions with memory operations , 2013, FPGA '13.

[8]  Thambipillai Srikanthan,et al.  Modelling communication overhead for accessing local memories in hardware accelerators , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[9]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[10]  Thambipillai Srikanthan,et al.  Custom instructions with local memory elements without expensive DMA transfers , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[11]  Thambipillai Srikanthan,et al.  Rapid design of area-efficient custom instructions for reconfigurable embedded processing , 2009, J. Syst. Archit..

[12]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[13]  Wu Jigang,et al.  Knapsack Model and Algorithm for HW/SW Partitioning Problem , 2004, International Conference on Computational Science.

[14]  Nikil D. Dutt,et al.  Automatic Identification of Application-Specific Functional Units with Architecturally Visible Storage , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[15]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[16]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[17]  Gu-Yeon Wei,et al.  Toward Cache-Friendly Hardware Accelerators , 2015 .

[18]  Hiroyuki Tomiyama,et al.  Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis , 2009, J. Inf. Process..

[19]  Ann Gordon-Ross,et al.  An Automated Hardware/Software Co-Design Flow for Partially Reconfigurable FPGAs , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[20]  Wu Jigang,et al.  Algorithmic Aspects of Hardware/Software Partitioning: 1D Search Algorithms , 2010, IEEE Transactions on Computers.

[21]  Muhammad Shafique,et al.  KAHRISMA: A Novel Hypermorphic Reconfigurable-Instruction-Set Multi-grained-Array Architecture , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[22]  Jörg Henkel,et al.  An approach to automated hardware/software partitioning using a flexible granularity that is driven by high-level estimation techniques , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[23]  Yuan Wen Hau,et al.  Hardware/software partitioning of embedded System-on-Chip applications , 2015, 2015 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC).