Exchanging multiple messages via MPI

In a series of benchmark codes we demonstrate that sending multiple messages is most efficiently done by using non-blocking communication. It is irrelevant whether these messages go to the same processing node or to different ones. The use of non-blocking communication allows the latencies of the individual communication calls to be overlapped and better utilises the dual plane design of the HPCx switch network. These findings can be applied to, for example halo exchanges in domain decomposition codes. We demonstrate that additional modest performance gains may be achieved by the overlap of communication and calculation.