Latency hiding in parallel systems: a quantitative approach

In many parallel applications, network latency causes a dramatic loss in processor utilization. This paper examines software pipelining as a technique for network latency hiding. It quantifies the potential improvements with detailed,instruction-level simulations. The benchmarks used are the Livermore Loop kernels and BLAS Level 1. These were parallelized and run on the instruction-level RISC simulator DLX, extended with both a blocking and a pipelined network. Our results show that prefetch in a pipelined network improves performance by a factor of 2 to 9, provided the network has sufficient bandwidth to accept at least 10 requests per processor.