Latency, bandwidth, and concurrent issue limitations in high-performance CFD.

To achieve high performance, a parallel algorithm needs to effectively utilize the memory subsystem and minimize the communication volume and the number of network transactions. These issues gain further importance on modern architectures, where the peak CPU performance is increasing much more rapidly than the memory or network performance. In this paper, we present some performance enhancing techniques that were employed on an unstructured mesh implicit solver. Our experimental results show that this solver adapts reasonably well to the high memory and network latencies.