Efficient task spawning for shared memory and message passing in many-core architectures

Abstract Modern many-core systems consist of large number of processing cores and introduce more and more parallelism. The (PGAS) programming model is a popular approach for exploiting this parallelism of architectures while offering flexibility of both shared memory and message passing paradigms. On the architecture design front, (NoCs) have become an integral part of the communication infrastructure due to their good scalability. In order to exploit task level parallelism on modern many-core architectures, the applications spawn more and more tasks to the available computing resources. The applications require less communication and synchronization delays for better performance. However, the distributed nature of NoCs poses a challenge to keep data communication and synchronization latency within the desired bound and hence results in higher task spawning overhead. We proposed an approach based on hardware-assisted task spawning on many-core systems [1]. In the current article, we present an extended version of our work for hardware-managed task spawning, keeping in view the communication requirements of both shared memory and message passing programming models. The proposed hardware support, integrated into the network interface architecture, reduces the synchronization overhead for task spawning. The software is offloaded from task spawning which results in an increase in the overall performance. The simulation results highlight that the proposed task spawning approach improves the overall performance up to 40% in comparison to an existing state-of-the-art approach [2]. To underline the applicability, we implemented an FPGA prototype to investigate real world applications. The investigations show that the proposed concept offers a low overhead in terms of implementation area cost on FPGA and ASIC platforms.

[1]  R. Schaller,et al.  Moore's law: past, present and future , 1997 .

[2]  Massimo Ruo Roch,et al.  MEDEA: a hybrid shared-memory/message-passing multiprocessor NoC-based architecture , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[3]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[4]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[5]  Jürgen Teich,et al.  Network Interface with Task Spawning Support for NoC-Based DSM Architectures , 2015, ARCS.

[6]  André Schiper,et al.  High-Throughput Maps on Message-Passing Manycore Architectures: Partitioning versus Replication , 2014, Euro-Par.

[7]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[8]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[9]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[10]  Dimitrios S. Nikolopoulos,et al.  On-chip communication and synchronization mechanisms with cache-integrated network interfaces , 2010, Conf. Computing Frontiers.

[11]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[12]  S.K. Reinhardt,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[13]  Massoud Pedram,et al.  A Novel Synthetic Traffic Pattern for Power/Performance Analysis of Network-on-Chips Using Negative Exponential Distribution , 2009, J. Low Power Electron..

[14]  Timothy G. Mattson,et al.  Light-weight communications on Intel's single-chip cloud computer processor , 2011, OPSR.

[15]  Carl Ramey,et al.  TILE-Gx100 ManyCore processor: Acceleration interfaces and architecture , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[16]  Mary K. Vernon,et al.  Comparison of hardware and software cache coherence schemes , 1991, ISCA '91.

[17]  Jürgen Becker,et al.  Providing multiple hard latency and throughput guarantees for packet switching networks on chip , 2013, Comput. Electr. Eng..

[18]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Om Prakash Gangwal,et al.  An efficient on-chip NI offering guaranteed services, shared-memory abstraction, and flexible network configuration , 2005 .

[20]  Shuming Chen,et al.  Run-Time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips , 2010, 2010 3rd International Symposium on Parallel Architectures, Algorithms and Programming.

[21]  Shuming Chen,et al.  Supporting Distributed Shared Memory on multi-core Network-on-Chips using a dual microcoded controller , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[22]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[23]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).