论文信息 - Synthesizing communication-efficient distributed-memory parallel programs for block recursive algorithms

Synthesizing communication-efficient distributed-memory parallel programs for block recursive algorithms

Distributed-memory multiprocessors have shown great promise as cost-effective machines for scalable high-performance supercomputing. However, developing efficient parallel programs using message-passing remains a difficult and error-prone task. In this thesis, we present a framework for synthesizing communication-efficient distributed-memory parallel programs for block recursive algorithms, such as the fast Fourier transform (FFT) and Strassen's matrix multiplication. This framework is based on an explicitly-parallel algebraic representation of the algorithms, which involves the tensor (Kronecker) product and other matrix operations. This representation is useful in analyzing the communication implications of computation partitioning and data distribution. In this framework, programs are synthesized under two different target program models. These two models are based on different ways of managing the distribution of data for optimizing communication. In the first model, distributions of arrays are kept static and so communication is needed whenever a processor requires non-local data elements. In the second model, distributions of the data arrays are dynamically changed to ensure that computation is localized in every computation step. This results in programs with different communication overhead characteristics due to use of different communication primitives for performing data movement. The first model uses point-to-point interprocessor communication primitives whereas the second model uses data redistribution primitives involving collective all-to-many communication. These two program models are shown to be suitable for different ranges of the problem size. The programs with redistributions have lower communication overhead than programs using a static distribution when the problem size is large. In order to achieve high-performance on distributed-memory machines, programs should be tailored to machine's architecture, such as the network topology. In the programs generated under the redistribution model, communication is optimized by using collective communication primitives tuned to the target machine's topology. The impact of communication overhead is further reduced by overlapping computation with communication. We present an overlapping technique which is suitable for distributed-memory machines with a wormhole or circuit-switched routed mesh or hypercube interconnection network. The methodology is illustrated by synthesizing communication-efficient programs for the FFT algorithms. We have incorporated this framework in the EXTENT portable parallel programming environment.

Sandeep Kumar S. Gupta