论文信息 - Asynchronous one-sided communications and synchronizations for a clustered manycore processor

Asynchronous one-sided communications and synchronizations for a clustered manycore processor

Clustered manycore architectures fitted with a Network-on-Chip (NoC) and scratchpad memories enable highly energy-efficient and time-predictable implementations. However, porting applications to such processors represents a programming challenge. Inspired by supercomputer one-sided communication libraries and by OpenCL async_work_group_copy primitives, we propose a simple programming layer for communication and synchronization on clustered manycore architectures. We discuss the design and implementation of this layer on the 2nd-generation Kalray MPPA processor, where it is available from both OpenCL and POSIX C/C++ multithreaded programming models. Our measurements show that it allows to reach up to 94% of the theoretical hardware throughput with a best-case latency round-trip of 2.2μs when operating at 500 MHz.

Benoît Dupont de Dinechin | Pierre Guironnet de Massas | Minh Quan Ho | Julien Hascoët

[1] Benoît Dupont de Dinechin,et al. A Distributed Run-Time Environment for the Kalray MPPA®-256 Integrated Manycore Processor , 2013, ICCS.

[2] Dhabaleswar K. Panda,et al. Optimizing mechanisms for latency tolerance in remote memory access communication on clusters , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[3] Alan L. Cox,et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[4] Selma Saidi,et al. The shift to multicores in real-time and safety-critical systems , 2015, 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[5] Dhabaleswar K. Panda,et al. High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[6] Dhabaleswar K. Panda,et al. Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.

[7] Bob Edwards,et al. Programming the Adapteva Epiphany 64-core network-on-chip coprocessor , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[8] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.

[9] Sergei Gorlatch,et al. Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[10] Michael A. Schuette,et al. The Reconfigurable Streaming Vector Processor (RSVPTM) , 2003, MICRO.

[11] Bryan Carpenter,et al. ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[12] Barbara M. Chapman,et al. Introducing OpenSHMEM: SHMEM for the PGAS community , 2010, PGAS '10.

[13] Pierre G. Paulin,et al. A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[14] William N. Scherer,et al. A new vision for coarray Fortran , 2009, PGAS '09.

[15] Martha Johanna Sepúlveda,et al. Notifying memories: A case-study on data-flow applications with NoC interfaces implementation , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[16] Seung-Yeop Lee,et al. Remote memory access: A case for portable, efficient and library independent parallel programming , 2004, Sci. Program..

[17] Andreas Olofsson. Epiphany-V: A 1024 processor 64-bit RISC System-On-Chip , 2016, ArXiv.

[18] Eric A. Brewer,et al. Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[19] Vivek Sarkar,et al. Languages and Compilers for Parallel Computing , 1994, Lecture Notes in Computer Science.