A case study in using massively parallel simulation for extreme-scale torus network codesign

A high-bandwidth, low-latency interconnect will be a critical component of future exascale systems. The torus network topology, which uses multidimensional network links to improve path diversity and exploit locality between nodes, is a potential candidate for exascale interconnects. The communication behavior of large-scale scientific applications running on future exascale networks is particularly important and analytical/algorithmic models alone cannot deduce it. Therefore, before building systems, it is important to explore the design space and performance of candidate exascale interconnects by using simulation. We improve upon previous work in this area and present a methodology for modeling and simulating a high-fidelity, validated, and scalable torus network topology at a packet-chunk level detail using the Rensselaer Optimistic Simulation System (ROSS). We execute various configurations of a 1.3 million node torus network model in order to examine the effect of torus dimensionality on network performance with relevant HPC traffic patterns. To the best of our knowledge, these are the largest torus network simulations that are carried out at such a detailed fidelity. In terms of simulation performance, a 1.3 million node, 9-D torus network model is shown to process a simulated exascale-class workload of nearest-neighbor traffic with 100 million message injections per second per node using 65,536 Blue Gene/Q cores in a simulation run-time of only 25 seconds. We also demonstrate that massive-scale simulations are a critical tool in exascale system design since small-scale torus simulations are not always indicative of the network behavior at an exascale size. The take-away message from this case study is that massively parallel simulation is a key enabler for effective extreme-scale network codesign.

[1]  William J. Dally,et al.  Cost-Efficient Dragonfly Topology for Large-Scale Systems , 2009, IEEE Micro.

[2]  David M. Nicol,et al.  Analysis of bounded time warp and comparison with YAWNS , 1996, TOMC.

[3]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[4]  Christopher D. Carothers,et al.  Large-scale TCP models using optimistic parallel simulation , 2003, Seventeenth Workshop on Parallel and Distributed Simulation, 2003. (PADS 2003). Proceedings..

[5]  Onkar Sahni,et al.  Parallel Adaptive Boundary Layer Meshing for CFD Analysis , 2012, IMR.

[6]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[7]  William J. Dally,et al.  Flattened Butterfly Topology for On-Chip Networks , 2007, IEEE Comput. Archit. Lett..

[8]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[9]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Sadaf R. Alam,et al.  Cray XT4: an early evaluation for petascale scientific simulation , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[11]  Mohamed Ould-Khaoua,et al.  Prediction of communication delay in torus networks under multiple time-scale correlated traffic , 2005, Perform. Evaluation.

[12]  Courtenay T. Vaughan,et al.  Investigating the Impact of the Cielo Cray XE6 Architecture on Scientific Application Codes , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[13]  Bruno Ciciani,et al.  An accurate model for the performance analysis of deterministic wormhole routing , 1997, Proceedings 11th International Parallel Processing Symposium.

[14]  Parosh Aziz Abdulla Impact of Architecture and Technology for Extreme Scale on Software and Algorithm Design , 2010 .

[15]  Ibm Redbooks,et al.  IBM System Blue Gene Solution: Blue Gene/P Application Development , 2009 .

[16]  Christopher D. Carothers,et al.  ROSS: a high-performance, low memory, modular time warp system , 2000, PADS '00.

[17]  Christopher D. Carothers,et al.  On deciding between conservative and optimistic approaches on massively parallel platforms , 2010, Proceedings of the 2010 Winter Simulation Conference.

[18]  Christopher D. Carothers,et al.  Analysis of time warp on a 32,768 processor ibm blue Gene/L supercomputer , 2008 .

[19]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[20]  David M. Nicol,et al.  Conservative Parallel Simulation of Continuous Time Markov Chains Using Uniformization , 1993, IEEE Trans. Parallel Distributed Syst..

[21]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[22]  Christopher D. Carothers,et al.  Efficient optimistic parallel simulations using reverse computation , 1999, Proceedings Thirteenth Workshop on Parallel and Distributed Simulation. PADS 99. (Cat. No.PR00155).

[23]  Philip Heidelberger,et al.  Blue Gene/L torus interconnection network , 2005, IBM J. Res. Dev..

[24]  William J. Dally,et al.  The torus routing chip , 2005, Distributed Computing.

[25]  Robert B. Ross,et al.  Model and simulation of exascale communication networks , 2012, J. Simulation.

[26]  Cruz Izu,et al.  The Adaptive Bubble Router , 2001, J. Parallel Distributed Comput..

[27]  Amith R. Mamidala,et al.  Looking under the hood of the IBM Blue Gene/Q network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  William J. Dally,et al.  Performance Analysis of k-Ary n-Cube Interconnection Networks , 1987, IEEE Trans. Computers.

[29]  Robert B. Ross,et al.  CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .

[30]  Christopher D. Carothers,et al.  Scalable Time Warp on Blue Gene Supercomputers , 2009, 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation.

[31]  Robert B. Ross,et al.  Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[32]  Laxmikant V. Kalé,et al.  Avoiding hot-spots on two-level direct networks , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[33]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[34]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[35]  Christopher D. Carothers,et al.  Warp speed: executing time warp on 1,966,080 cores , 2013, SIGSIM-PADS.

[36]  Anant Agarwal,et al.  Limits on Interconnection Network Performance , 1991, IEEE Trans. Parallel Distributed Syst..

[37]  Christopher D. Carothers,et al.  Efficient optimistic parallel simulations using reverse computation , 1999, Workshop on Parallel and Distributed Simulation.

[38]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..