A HLS-Based Toolflow to Design Next-Generation Heterogeneous Many-Core Platforms with Shared Memory

This work describes how we use High-Level Synthesis to support design space exploration (DSE) of heterogeneous many-core systems. Modern embedded systems increasingly couple hardware accelerators and processing cores on the same chip, to trade specialization of the platform to an application domain for increased performance and energy efficiency. However, the process of designing such a platform is complex and error-prone, and requires skills on algorithmic aspects, hardware synthesis, and software engineering. DSE can partially be automated, and thus simplified, by coupling the use of HLS tools and virtual prototyping platforms. In this paper we enable the design space exploration of heterogeneous many-cores adopting a shared-memory architecture template, where communication and synchronization between the hardware accelerators and the cores happens through L1 shared memory. This communication infrastructure leverages a "zero-copy" scheme, which simplifies both the design process of the platform and the development of applications on top of it. Moreover, the shared-memory template perfectly fits the semantics of several high-level programming models, such as OpenMP. We provide programmers with simple yet powerful abstractions to exploit accelerators from within an OpenMP application, and propose a low-cost implementation of the necessary runtime support. An HLS-based automatic design flow is set up, to quickly explore the design space using a cycle-accurate virtual platform.

[1]  Paolo Ienne,et al.  Speculative DMA for architecturally visible storage in instruction set extensions , 2008, CODES+ISSS '08.

[2]  Luca Benini,et al.  Supporting OpenMP on a multi-cluster embedded MPSoC , 2011, Microprocess. Microsystems.

[3]  Jason Helge Anderson,et al.  From software threads to parallel hardware in high-level synthesis for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[4]  Luca Benini,et al.  Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  Luca Benini,et al.  P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[6]  Luca Benini,et al.  Synthesis-friendly techniques for tightly-coupled integration of hardware accelerators into shared-memory multi-core clusters , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[7]  Paolo Ienne,et al.  Way Stealing: Cache-assisted automatic Instruction Set Extensions , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[8]  Nikil D. Dutt,et al.  Introduction of Architecturally Visible Storage in Instruction Set Extensions , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[9]  Luca Benini,et al.  VirtualSoC: A Full-System Simulation Environment for Massively Parallel Heterogeneous System-on-Chip , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[10]  Steven Swanson,et al.  Greendroid: Exploring the next evolution in smartphone application processors , 2011, IEEE Communications Magazine.

[11]  Eduard Ayguadé,et al.  OpenMP extensions for FPGA accelerators , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[12]  Zhen Fang,et al.  Buffer-Integrated-Cache: A cost-effective SRAM architecture for handheld and embedded platforms , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Luca Benini,et al.  Architecture and programming model support for efficient heterogeneous computing on tigthly-coupled shared-memory clusters , 2013, 2013 Conference on Design and Architectures for Signal and Image Processing.

[14]  Weng-Fai Wong,et al.  Generating hardware from OpenMP programs , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[15]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[16]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[17]  Luca Benini,et al.  A tightly-coupled hardware controller to improve scalability and programmability of shared-memory heterogeneous clusters , 2014, DATE 2014.

[18]  Luca Benini,et al.  A tightly-coupled multi-core cluster with shared-memory HW accelerators , 2012, 2012 International Conference on Embedded Computer Systems (SAMOS).

[19]  Piotr Dziurzanski,et al.  A system for transforming an ANSI C code with OpenMP directives into a SystemC description , 2006, 2006 IEEE Design and Diagnostics of Electronic Circuits and systems.

[20]  Eric E. Aubanel,et al.  An OpenMP-based circuit design tool: Customizable bit-width , 2009, 2009 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.