Mechanisms for hiding communication latency in data parallel architectures

The goal of this dissertation is to explore techniques to improve the performance of interprocessor communication on data parallel architectures. In many cases, interprocessor communication latency is a significant fraction of the overall time required to execute data parallel applications. The lockstep model of execution used by more traditional SIMD machines does not admit any possibility of exploiting task parallelism. We will demonstrate that a shift in the SIMD paradigm to enable a small degree of task parallelism can reduce the communication overhead independent of any improvements in technology. In this work, we identify two primary mechanisms for exploiting communication concurrency in data parallel applications: overlapping communication with computation, and overlapping communication with other communication. We propose an architectural framework, referred to as concurrently communicating SIMD (CCSIMD), to exploit communication concurrency in data parallel applications, and study three specific implementations of CCSIMD. The impact of these architectures on a suite of data parallel applications is studied. Results show that exploiting communication concurrency can lead to significant improvements in performance. For well-balanced architectures, overlapping communication with other computation offers the most benefit. In architectures with relatively stronger computational support, overlapping communication with other communication is better. Combining the two techniques can often result in better performance than employing either technique by itself.