Merrimac: Supercomputing with Streams

Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

[1]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[2]  A. Jameson ANALYSIS AND DESIGN OF NUMERICAL SCHEMES FOR GAS DYNAMICS, 1: ARTIFICIAL DIFFUSION, UPWIND BIASING, LIMITERS AND THEIR EFFECT ON ACCURACY AND MULTIGRID CONVERGENCE , 1995 .

[3]  Jérôme Jaffré,et al.  CONVERGENCE OF THE DISCONTINUOUS GALERKIN FINITE ELEMENT METHOD FOR HYPERBOLIC CONSERVATION LAWS , 1995 .

[4]  Eric F Darve,et al.  Calculating free energies using average force , 2001 .

[5]  Timothy J. Barth,et al.  Simplified Discontinuous Galerkin Methods for Systems of Conservation Laws with Convex Extension , 2000 .

[6]  William J. Dally,et al.  Performance Analysis of k-Ary n-Cube Interconnection Networks , 1987, IEEE Trans. Computers.

[7]  Harvey J. Wasserman,et al.  A performance comparison of four supercomputers , 1992, CACM.

[8]  Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 15-21 November 2003, Phoenix, AZ, USA, CD-Rom , 2003 .

[9]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[10]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[11]  William J. Dally,et al.  Scalable opto-electronic network (SOENet) , 2002, Proceedings 10th Symposium on High Performance Interconnects.

[12]  Steven L. Scott,et al.  The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .

[13]  Juan J. Alonso,et al.  StreamFLO: an Euler solver for streaming architectures , 2004 .

[14]  William J. Dally,et al.  Exploring the VLSI scalability of stream processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[15]  A. Jameson ANALYSIS AND DESIGN OF NUMERICAL SCHEMES FOR GAS DYNAMICS, 2: ARTIFICIAL DIFFUSION AND DISCRETE SHOCK STRUCTURE , 1994 .

[16]  Yannis Kallinderis,et al.  Generic parallel adaptive-grid Navier-Stokes algorithm , 1994 .

[17]  W. H. Reed,et al.  Triangular mesh methods for the neutron transport equation , 1973 .

[18]  Chi-Wang Shu,et al.  The Runge-Kutta local projection discontinuous Galerkin finite element method for conservation laws. IV. The multidimensional case , 1990 .

[19]  William J. Dally,et al.  Digital systems engineering , 1998 .

[20]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[21]  Eric Darve,et al.  Calculating Free Energies Using a Scaled-Force Molecular Dynamics Algorithm , 2002 .

[22]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[23]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.