DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures

The recent trend towards tile-based manycore architectures has helped to tackle the memory wall by physically distributing memories and processing nodes. However, this introduced a data-to-task locality challenge and inter-tile communication thus often imposes significant software overhead. Thus, we proposed software-defined hardware-managed SHARQ queues that enable efficient inter-tile communication by leveraging user-defined queues with arbitrarily sized elements. To ensure (remote) processing of queued elements, SHARQ introduces an optional handler task, which is scheduled by hardware on demand. Queue management, intra- and inter-tile data transfer, and handler task invocation are entirely handled by hardware. Only rare tasks, like the dynamic queue creation at run-time, are performed in software. DySHARQ, an extension of SHARQ, enables dynamic and concurrent queue memory management and queue length adjustments to be able to adapt to application and resource requirement changes. The DySHARQ hardware is able to monitor the queue memory requirements at run-time and conditionally schedules a software-defined memory management task. It further optimizes the hardware-software interaction for local queue operations. We integrated DySHARQ into the MPI library used by the NAS benchmarks. The evaluation shows a reduction in execution time by up to 43% (compared to software) for the communication intense IS kernel in a 4  $$\times$$  4 tile design on an FPGA platform with a total of 80 LEON3 cores. The dynamic memory management reduces the memory footprint by 3.75 $$\times$$ in a 2  $$\times$$  2 design.

[1]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[2]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[3]  Sanghoon Lee,et al.  HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[4]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[5]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[6]  Jaspal Subhlok,et al.  Characterizing NAS benchmark performance on shared heterogeneous networks , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[7]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[8]  Ron Sass,et al.  Exploring hardware work queue support for lightweight threads in MPSoCs , 2012, 2012 International Conference on Reconfigurable Computing and FPGAs.

[9]  B. Grundmann,et al.  From Single Core to Multi-Core: Preparing for a new exponential , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[10]  Ren Wang,et al.  CAF: Core to core Communication Acceleration Framework , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[11]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[12]  Sparsh Mittal A survey on evaluating and optimizing performance of Intel Xeon Phi , 2020, Concurr. Comput. Pract. Exp..

[13]  PattersonDavid,et al.  A Case for Intelligent RAM , 1997 .

[14]  Anant Agarwal,et al.  Integrating message-passing and shared-memory: early experience , 1993, PPOPP '93.

[15]  Mark Moir,et al.  Concurrent Data Structures , 2004, Handbook of Data Structures and Applications.

[16]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[17]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[18]  Jongman Kim,et al.  IsoNet: Hardware-Based Job Queue Management for Many-Core Architectures , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[19]  Timo Hönig,et al.  Asynchronous Abstract Machines: Anti-noise System Software for Many-core Processors , 2019 .

[20]  Filip Moerman Open event machine: A multi-core run-time designed for performance , 2014, 2014 6th European Embedded Design in Education and Research Conference (EDERC).

[21]  Jürgen Teich,et al.  Efficient task spawning for shared memory and message passing in many-core architectures , 2017, J. Syst. Archit..

[22]  Wolfgang Schröder-Preikschat,et al.  SHARQ: Software-Defined Hardware-Managed Queues for Tile-Based Manycore Architectures , 2019, SAMOS.

[23]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[24]  Jürgen Teich,et al.  The Invasive Network on Chip - A Multi-Objective Many-Core Communication Infrastructure , 2014, ARCS Workshops.

[25]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[26]  André Schiper,et al.  Leveraging Hardware Message Passing for Efficient Thread Synchronization , 2016, ACM Trans. Parallel Comput..

[27]  Andreas Herkersdorf,et al.  TCU: A Multi-Objective Hardware Thread Mapping Unit for HPC Clusters , 2016, ISC.

[28]  Jürgen Teich,et al.  Invasive Computing: An Overview , 2011, Multiprocessor System-on-Chip.

[29]  Lars Bauer,et al.  System Software for Resource Arbitration on Future Many-* Architectures , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[30]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[31]  Hsiao-Keng Jerry Chu,et al.  Zero-Copy TCP in Solaris , 1996, USENIX Annual Technical Conference.

[32]  Jean-Philippe Diguet,et al.  Subutai: Distributed Synchronization Primitives in NoC Interfaces for Legacy Parallel-Applications , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[33]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[34]  Rainer Buchty,et al.  Data-Centric Computing Frontiers: A Survey On Processing-In-Memory , 2016, MEMSYS.

[35]  Andreas Schenk,et al.  CaCAO: Complex and Compositional Atomic Operations for NoC-Based Manycore Platforms , 2018, ARCS.