Dataflow-like Synchronization in a PGAS Programming Model

It is expected that the first exascale supercomputer will be deployed within the next 10 years, but which programming model will allow easy development and yet scalable and efficient programs is still not known. One of the programming models considered to be feasible is the so-called partitioned global address space~(PGAS) model, which allows easy development by providing one common memory address space across all cluster nodes. In this paper we compare remote memory access and memory consistency of current PGAS programming languages and describe how synchronization can generated unneeded network transfers. We furthermore introduce our variation of the PGAS model that allows for implicit fine-grained pair-wise synchronization among the nodes. Efficient and easy to use synchronization is necessary to keep all the processors of upcoming supercomputers busy. We furthermore offer easy deployment of RDMA transfers and use communication algorithms commonly used in MPI collective operations, but lift the requirement of the operations to be collective. Our model is based on single assignment variables and uses a data-flow like synchronization mechanism. Reading uninitialized variables results in the reading thread to be blocked until data are made available by another thread. That way synchronization is done implicitly when data are read. Broadcast, scatter and gather are modeled based on data distribution among the nodes, whereas for reduction and scan we follow a combining PRAM approach of having multiple threads write to the same memory location. We discuss both a Gauß-Seidel stencil and bitonic sort in our model. We implemented a proof-of-concept library showing the usability and scalability of the model. With this library the Gauß-Seidel stencil scaled well in initial experiments on an 8-node machine.