Low-level router design and its impact on supercomputer system performance

Supercomputer performance is highly dependent on its interconnection subsystem design. In this paper we study how di erent architectural approaches for router design impact into system performance when running real parallel applications. A thorough methodology has been employed to quantify this impact. Architectural router decisions have been chosen taking into account the constraints of the underlying VLSI technology. After that, an exhaustive evaluation of the interconnection network under standard synthetic tra c has been carried out. Finally, an execution-driven simulation environment has been used to assess the consequences of several router designs on the performance of the entire machine. We will show that low-level decisions, as the adequate selection of router's arbiter, signi cantly reduce the execution time of parallel applications. To illustrate the e ects of the router architecture on system performance two benchmarks were selected: Radix and MP3D.

[1]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[2]  Anant Agarwal,et al.  Limits on Interconnection Network Performance , 1991, IEEE Trans. Parallel Distributed Syst..

[3]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[4]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[5]  Sarita V. Adve,et al.  RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors , 1997 .

[6]  Andrew A. Chien,et al.  A Cost and Speed Model for k-ary n-Cube Wormhole Routers , 1998, IEEE Trans. Parallel Distributed Syst..

[7]  William J. Dally,et al.  Performance Analysis of k-Ary n-Cube Interconnection Networks , 1987, IEEE Trans. Computers.

[8]  Charles L. Seitz,et al.  A family of routing and communication chips based on the Mosaic , 1993 .

[9]  Steven L. Scott,et al.  The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .

[10]  Carmen Carrión,et al.  A flow control mechanism to avoid message deadlock in k-ary n-cube networks , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[11]  Yuval Tamir,et al.  Symmetric Crossbar Arbiters for VLSI Communication Switches , 1993, IEEE Trans. Parallel Distributed Syst..

[12]  Chita R. Das,et al.  Performance benefits of virtual channels and adaptive routing: an application-driven study , 1997, ICS '97.

[13]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[14]  A. A. Chein,et al.  A cost and speed model for k-ary n-cube wormhole routers , 1998 .

[15]  Sarita V. Adve,et al.  RSIM Reference Manual: Version 1.0 , 1997 .

[16]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[17]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[18]  William J. Dally,et al.  Architecture and implementation of the reliable router , 1994, Symposium Record Hot Interconnects II.

[19]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[20]  J DallyWilliam,et al.  Performance Analysis of k-ary n-cube Interconnection Networks , 1990 .

[21]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.