Performance Modeling of Distributed Memory Architectures

We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single-source and multiple-source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multidimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, the data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory and a set of processor registers. The analytic models are verified by measurements on the Connection Machine Model CM-2.

[1]  Donald Fraser,et al.  Array Permutation by Index-Digit Permutation , 1976, JACM.

[2]  Howard Jay Siegel,et al.  Interconnection networks for large-scale parallel processing: theory and case studies (2nd ed.) , 1985 .

[3]  C. K. Yuen The separability of Gray code (Corresp.) , 1974, IEEE Trans. Inf. Theory.

[4]  Leslie G. Valiant,et al.  Universal schemes for parallel communication , 1981, STOC '81.

[5]  Sartaj Sahni,et al.  Optimal BPC Permutations on a Cube Connected SIMD Computer , 1982, IEEE Transactions on Computers.

[6]  Peter D. Welch,et al.  The fast Fourier transform algorithm: Programming considerations in the calculation of sine, cosine and Laplace transforms☆ , 1970 .

[7]  S. Lennart Johnsson,et al.  Cooley-Tukey FFT on the Connection Machine , 1992, Parallel Comput..

[8]  Peter M. Flanders A Unified Approach to a Class of Data Movements on an Array Processor , 1982, IEEE Transactions on Computers.

[9]  John N. Tsitsiklis,et al.  Optimal Communication Algorithms for Hypercubes , 1991, J. Parallel Distributed Comput..

[10]  S. Lennart Johnsson,et al.  Generalized Shuffle Permutations on Boolean Cubes , 1992, J. Parallel Distributed Comput..

[11]  Ching-Tien Ho,et al.  Embedding Meshes into Small Boolean Cubes , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[12]  Dennis Gannon,et al.  On the problem of optimizing data transfers for complex memory systems , 1988, ICS '88.

[13]  Reinhold Weicker,et al.  Dhrystone: a synthetic systems programming benchmark , 1984, CACM.

[14]  Leslie G. Valiant,et al.  A Scheme for Fast Parallel Communication , 1982, SIAM J. Comput..

[15]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[16]  William Jalby,et al.  Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design , 1988 .

[17]  Abhiram G. Ranade,et al.  Fluent parallel computation , 1989 .

[18]  Ching-Tien Ho,et al.  The Complexity of Reshaping Arrays on Boolean Cubes , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[19]  Sandeep N. Bhatt,et al.  The fluent abstract machine , 1988 .

[20]  L. Johnsson,et al.  Optimal algorithms for stable dimension permutations on Boolean cubes , 1988, C3P.

[21]  Jack J. Dongarra,et al.  Performance of various computers using standard linear equations software in a FORTRAN environment , 1988, CARN.

[22]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[23]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[24]  Charles L. Seitz,et al.  A framework for adaptive routing in multicomputer networks , 1989, CARN.

[25]  S. Lennart Johnsson,et al.  Optimizing Tridiagonal Solvers for Alternating Direction Methods on Boolean Cube Multiprocessors , 1989, SIAM J. Sci. Comput..

[26]  Paul N. Swarztrauber,et al.  Ordered Fast Fourier Transforms on a Massively Parallel Hypercube Multiprocessor , 1991, J. Parallel Distributed Comput..

[27]  Ching-Tien Ho,et al.  Systolic FFT algorithms on Boolean cube networks , 1988, [1988] Proceedings. International Conference on Systolic Arrays.

[28]  S. Johnsson Solving tridiagonal systems on ensemble architectures , 1987 .

[29]  S. Lennart Johnsson,et al.  Band matrix systems solvers on ensemble architecture , 1990 .

[30]  Abhiram G. Ranade,et al.  How to emulate shared memory , 1991, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[31]  J. Y. Ngai,et al.  A framework for adaptive routing in multicomputer networks , 1989, CARN.

[32]  Brian A. Wichmann,et al.  A Synthetic Benchmark , 1976, Comput. J..