It is expected that the first exascale supercomputer will be deployed within the next 10 years, but which programming model will allow easy development and yet scalable and efficient programs is still not known. One of the programming models considered to be feasible is the so-called partitioned global address space~(PGAS) model, which allows easy development by providing one common memory address space across all cluster nodes. In this paper we compare remote memory access and memory consistency of current PGAS programming languages and describe how synchronization can generated unneeded network transfers. We furthermore introduce our variation of the PGAS model that allows for implicit fine-grained pair-wise synchronization among the nodes. Efficient and easy to use synchronization is necessary to keep all the processors of upcoming supercomputers busy. We furthermore offer easy deployment of RDMA transfers and use communication algorithms commonly used in MPI collective operations, but lift the requirement of the operations to be collective. Our model is based on single assignment variables and uses a data-flow like synchronization mechanism. Reading uninitialized variables results in the reading thread to be blocked until data are made available by another thread. That way synchronization is done implicitly when data are read. Broadcast, scatter and gather are modeled based on data distribution among the nodes, whereas for reduction and scan we follow a combining PRAM approach of having multiple threads write to the same memory location. We discuss both a Gauß-Seidel stencil and bitonic sort in our model. We implemented a proof-of-concept library showing the usability and scalability of the model. With this library the Gauß-Seidel stencil scaled well in initial experiments on an 8-node machine.
[1]
Dan Bonachea.
GASNet Specification, v1.1
,
2002
.
[2]
Katherine Yelick,et al.
Introduction to UPC and Language Specification
,
2000
.
[3]
Allan Porterfield,et al.
Exploiting heterogeneous parallelism on a multithreaded multiprocessor
,
1992,
ICS '92.
[4]
J. Breitbart.
Programming hybrid systems with implicit memory based synchronization
,
2011
.
[5]
Sreedhar B. Kodali,et al.
The Asynchronous Partitioned Global Address Space Model
,
2010
.
[6]
Ian Watson,et al.
The Manchester prototype dataflow computer
,
1985,
CACM.
[7]
Gregory Francis Pfister,et al.
In search of clusters (2nd ed.)
,
1998
.
[8]
Simon L. Peyton Jones,et al.
Concurrent Haskell
,
1996,
POPL '96.
[9]
P. Altena,et al.
In search of clusters
,
2007
.
[10]
Burton J. Smith.
Architecture And Applications Of The HEP Multiprocessor Computer System
,
1982,
Optics & Photonics.
[11]
Parosh Aziz Abdulla.
Impact of Architecture and Technology for Extreme Scale on Software and Algorithm Design
,
2010
.
[12]
David C. Cann,et al.
A Report on the Sisal Language Project
,
1990,
J. Parallel Distributed Comput..