Network-related performance issues and techniques for MPPs

In this paper we review network related performance issues for current Massively Parallel Processors (MPPs) in the context of some important basic operations in scientific and engineering computation. The communication system is one of the most performance critical architectural components of MPPs. In particular, understanding the demand posed by collective communication is critical in architectural design and system software implementation. We discuss collective communication and some implementation techniques therefore on electronic networks. Finally, we give an example of a novel general routing technique that exhibits good scalability, efficiency and simplicity in electronic networks.

[1]  Paul Budnik,et al.  The Organization and Use of Parallel Memories , 1971, IEEE Transactions on Computers.

[2]  Duncan H. Lawrie,et al.  The Prime Memory System for Array Access , 1982, IEEE Transactions on Computers.

[3]  M. Fiedler A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory , 1975 .

[4]  Lawrence Snyder,et al.  The chaos router: a practical application of randomization in network routing , 1990, SPAA '90.

[5]  Robert Wille,et al.  The chaos router chip: design and implementation of an adaptive router , 1993, VLSI.

[6]  Sandeep N. Bhatt,et al.  The fluent abstract machine , 1988 .

[7]  Alan Needleman,et al.  Dynamic 3D analysis of the Charpy V-notch test , 1993 .

[8]  S. Teng Points, spheres, and separators: a unified geometric approach to graph partitioning , 1992 .

[9]  Quentin F. Stout,et al.  Passing messages in link-bound hypercubes , 1986 .

[10]  Smaragda Konstantinidou,et al.  Adaptive, minimal routing in hypercubes , 1990 .

[11]  S. Lennart Johnsson,et al.  Optimal communication channel utilization for matrix transposition and related permutations on binary cubes , 1994, Discret. Appl. Math..

[12]  Yuh-Dauh Lyuu,et al.  An information dispersal approach to issues in parallel processing , 1990 .

[13]  Ching-Tien Ho,et al.  Computing Fast Fourier Transforms On Boolean Cubes And Related Networks , 1988, Optics & Photonics.

[14]  Gary L. Miller,et al.  Density graphs and separators , 1991, SODA '91.

[15]  Gary L. Miller,et al.  Separators in two and three dimensions , 1990, STOC '90.

[16]  S. Lennart Johnsson,et al.  Performance Modeling of Distributed Memory Architectures , 1991, J. Parallel Distributed Comput..

[17]  S. Lennart Johnsson,et al.  All-to-All Communication on the Connection Machine CM-200 , 1995, Sci. Program..

[18]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[19]  Horst D. Simon,et al.  Partitioning of unstructured problems for parallel processing , 1991 .

[20]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[21]  W. Daniel Hillis,et al.  The connection machine , 1985 .

[22]  M. Fiedler Eigenvectors of acyclic matrices , 1975 .

[23]  S. Lennart Johnsson,et al.  Generalized Shuffle Permutations on Boolean Cubes , 1992, J. Parallel Distributed Comput..

[24]  S. Lennart Johnsson,et al.  Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer , 1994, Parallel Comput..

[25]  S. Johnsson,et al.  Spanning balanced trees in Boolean cubes , 1989 .

[26]  S. Lennart Johnsson,et al.  Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..

[27]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[28]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[29]  Duncan H. Lawrie,et al.  Access and Alignment of Data in an Array Processor , 1975, IEEE Transactions on Computers.

[30]  Sudhakar Yalamanchili,et al.  Adaptive routing protocols for hypercube interconnection networks , 1993, Computer.

[31]  S. Lennart Johnsson,et al.  Optimizing Tridiagonal Solvers for Alternating Direction Methods on Boolean Cube Multiprocessors , 1989, SIAM J. Sci. Comput..

[32]  S. Lennart Johnsson,et al.  All-To-All Broadcast and Applications On the Connection Machine , 1992, Int. J. High Perform. Comput. Appl..

[33]  S. Lennart Johnsson,et al.  Cooley-Tukey FFT on the Connection Machine , 1992, Parallel Comput..

[34]  Gary L. Miller,et al.  A unified geometric approach to graph separators , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[35]  G.D. Pifarre,et al.  Fully Adaptive Minimal Deadlock-Free Packet Routing in Hypercubes, Meshes, and other Networks: Algorithms and Simulations , 1994, IEEE Trans. Parallel Distributed Syst..

[36]  Abhiram G. Ranade,et al.  How to emulate shared memory , 1991, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[37]  Quentin F. Stout,et al.  Intensive Hypercube Communication. Prearranged Communication in Link-Bound Machines , 1990, J. Parallel Distributed Comput..

[38]  K. K. Mathur,et al.  Communication primitives for unstructured finite element simulations on data parallel architectures , 1992 .

[39]  Wolfgang J. Paul,et al.  On the Physical Design of PRAMs , 1992, Informatik.

[40]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[41]  S. Lennart Johnsson,et al.  Block-Cyclic Dense Linear Algebra , 1993, SIAM J. Sci. Comput..

[42]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[43]  Clive Temperton On the FACR( l) algorithm for the discrete Poisson equation , 1980 .

[44]  S. Lennart Johnsson,et al.  Alternating direction methods on multiprocessors , 1987 .

[45]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[46]  Charles L. Seitz,et al.  A framework for adaptive routing in multicomputer networks , 1989, CARN.

[47]  Zdeněk Johan,et al.  Data parallel finite element techniques for large-scale computational fluid dynamics , 1992 .

[48]  Guy L. Steele,et al.  Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[49]  S. Lennart Johnsson,et al.  Communication Efficient Basic Linear Algebra Computations on Hypercube Architectures , 1987, J. Parallel Distributed Comput..

[50]  William George,et al.  POLYSHIFT Communications Software for the Connection Machine System CM-200 , 1994, Sci. Program..

[51]  Leslie G. Valiant,et al.  A Scheme for Fast Parallel Communication , 1982, SIAM J. Comput..

[52]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[53]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[54]  Gary L. Miller,et al.  Automatic Mesh Partitioning , 1992 .

[55]  Lennart Johnsson Matrix Multiplication on Boolean Cubes using Generic Communication Primitives , 1989 .

[56]  Thomas J. R. Hughes,et al.  Scalability of finite element applications on distributed-memory parallel computers , 1994 .

[57]  Paul N Swarztrauber Symmetric FFTs , 1986 .

[58]  Leslie G. Valiant,et al.  Universal schemes for parallel communication , 1981, STOC '81.

[59]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[60]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[61]  W. M. Gentleman,et al.  Fast Fourier Transforms: for fun and profit , 1966, AFIPS '66 (Fall).

[62]  S. Lennart Johnsson,et al.  Data structures and algorithms for the finite element method on a data parallel supercomputer , 1990 .

[63]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.