Asynchronous one-sided communications and synchronizations for a clustered manycore processor

Clustered manycore architectures fitted with a Network-on-Chip (NoC) and scratchpad memories enable highly energy-efficient and time-predictable implementations. However, porting applications to such processors represents a programming challenge. Inspired by supercomputer one-sided communication libraries and by OpenCL async_work_group_copy primitives, we propose a simple programming layer for communication and synchronization on clustered manycore architectures. We discuss the design and implementation of this layer on the 2nd-generation Kalray MPPA processor, where it is available from both OpenCL and POSIX C/C++ multithreaded programming models. Our measurements show that it allows to reach up to 94% of the theoretical hardware throughput with a best-case latency round-trip of 2.2μs when operating at 500 MHz.

[1]  Benoît Dupont de Dinechin,et al.  A Distributed Run-Time Environment for the Kalray MPPA®-256 Integrated Manycore Processor , 2013, ICCS.

[2]  Dhabaleswar K. Panda,et al.  Optimizing mechanisms for latency tolerance in remote memory access communication on clusters , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[3]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[4]  Selma Saidi,et al.  The shift to multicores in real-time and safety-critical systems , 2015, 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[5]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[6]  Dhabaleswar K. Panda,et al.  Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.

[7]  Bob Edwards,et al.  Programming the Adapteva Epiphany 64-core network-on-chip coprocessor , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[8]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[9]  Sergei Gorlatch,et al.  Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[10]  Michael A. Schuette,et al.  The Reconfigurable Streaming Vector Processor (RSVPTM) , 2003, MICRO.

[11]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[12]  Barbara M. Chapman,et al.  Introducing OpenSHMEM: SHMEM for the PGAS community , 2010, PGAS '10.

[13]  Pierre G. Paulin,et al.  A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[14]  William N. Scherer,et al.  A new vision for coarray Fortran , 2009, PGAS '09.

[15]  Martha Johanna Sepúlveda,et al.  Notifying memories: A case-study on data-flow applications with NoC interfaces implementation , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[16]  Seung-Yeop Lee,et al.  Remote memory access: A case for portable, efficient and library independent parallel programming , 2004, Sci. Program..

[17]  Andreas Olofsson Epiphany-V: A 1024 processor 64-bit RISC System-On-Chip , 2016, ArXiv.

[18]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[19]  Vivek Sarkar,et al.  Languages and Compilers for Parallel Computing , 1994, Lecture Notes in Computer Science.