BQCD with GPI: A case study

We compare the BQCD performance, a typical high performance computer application, using either the MPI or the Fraunhofer GPI communication library. In our analysis, we focus on the BQCD performance critical part covering 50 percent of the total program run-time. This is given by the computation of a four-dimensional nearest neighbor stencil operator in a domain decomposed simulation volume. Hence, BQCD is a typical representative for the broad class of stencil algorithms. In order to obtain optimal speedup, we overlap the communication with the computation and analyse the resulting run-times on two test systems. We introduce the overlap efficiency as a measure for the communication library's ability to overlap the communication with the computation. In the regime in which the raw communication latency is less than the raw computational time, the overlap efficiency should be equal to one. This regime depends on the problem size and on the number of used cores. Deviations from one show possible interferences of communication and computation induced by the communication library. Side effects which disturb the scalability in practice. As result, we find that GPI has overlap efficiency equal to one, i.e. it allows for perfect overlap and ideal scalability. The total runtime is equal to the time spent for the pure computation. For the same communication pattern, MPI has overlap efficiency less than one. It cannot hide the communication completely which results in a worse scalability in general. The GPI speedups in comparison with the equivalent MPI implementation are of the order of 20-30 percent.