ARMCI: A Portable Aggregate Remote Memory Copy Interface

Motivation and Background A portable lightweight remote memory copy is needed in parallel distributed-array librar ies and compiler runtime systems. However, a simple API for transfers of contiguous blocks of data in the style of the local memory copy operation memcpy(addr1, addr2, nbytes) is not optimal for this purpose. In particular, such an API can lead to poor performance on systems with high latency networks for applications that require noncontiguous data transfers (for example, sections of dense multidimensional arrays or scatter/gather operation s). In most cases, the performance loss is due to the communication subsystem handling each contiguous portion of the data as a separate message. This causes the communication startup costs to be incurred multiple times rather than once. The problem is contributed to the inadequate API that does not pass the information about the intended data transfer and actual layout of the user data to the communication subsystem. Usually, there are many ways a communication libr ry could optimize the performance if a more descriptive communication interface is used, for example: 1) minimize the number of underlying network packets by packing distinct blocks of data into as few packets as possible, 2) minimize the number of interrupts in the interrupt-driven message-delivery systems, and 3) take advantage of any available shared memory optimizations (prefetching/poststoring) on the shared memory systems. In principle, remote copy operations should map directly -without intermediate copying of the data -to the native high-performance memory copy operations (including bulk data transfer facilities) when shared memory is used.