FatTreeSim: Modeling Large-scale Fat-Tree Networks for HPC Systems and Data Centers Using Parallel and Discrete Event Simulation

Fat-tree topologies have been widely adopted as the communication network in data centers in the past decade. Nowadays, high-performance computing (HPC) system designers are considering using fat-tree as the interconnection network for the next generation supercomputers. For extreme-scale computing systems like the data centers and supercomputers, the performance is highly dependent on the interconnection networks. In this paper, we present FatTreeSim, a PDES-based toolkit consisting of a highly scalable fat-tree network model, with the goal of better understanding the design constraints of fat-tree networking architectures in data centers and HPC systems, as well as evaluating the applications running on top of the network. FatTreeSim is designed to model and simulate large-scale fat-tree networks up to millions of nodes with protocol-level fidelity. We have conducted extensive experiments to validate and demonstrate the accuracy, scalability and usability of FatTreeSim. On Argonne Leadership Computing Facility's Blue Gene/Q system, Mira, FatTreeSim is capable of achieving a peak event rate of 305 M/s for a 524,288-node fat-tree model with a total of 567 billion committed events. The strong scaling experiments use up to 32,768 cores and show a near linear scalability. Comparing with a small-scale physical system in Emulab, FatTreeSim can accurately model the latency in the same fat-tree network with less than 10% error rate for most cases. Finally, we demonstrate FatTreeSim's usability through a case study in which FatTreeSim serves as the network module of the YARNsim system, and the error rates for all test cases are less than 13.7%.

[1]  Robert B. Ross,et al.  Model and simulation of exascale communication networks , 2012, J. Simulation.

[2]  Robert B. Ross,et al.  A case study in using massively parallel simulation for extreme-scale torus network codesign , 2014, SIGSIM PADS '14.

[3]  Dennis Abts,et al.  A Guided Tour through Data-center Networking , 2012, ACM Queue.

[4]  Robert B. Ross,et al.  FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[5]  Teerawat Issariyakul,et al.  Introduction to Network Simulator NS2 , 2008 .

[6]  Kalyan S. Perumalla,et al.  Simulating billion-task parallel programs , 2014, International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS 2014).

[7]  Robert B. Ross,et al.  YARNsim: Simulating Hadoop YARN , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[8]  David Thaler,et al.  Multipath Issues in Unicast and Multicast Next-Hop Selection , 2000, RFC.

[9]  Maozhen Li,et al.  HSim: A MapReduce simulator in enabling Cloud Computing , 2013, Future Gener. Comput. Syst..

[10]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[11]  Christopher D. Carothers,et al.  Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation , 2011, 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation.

[12]  Robert B. Ross,et al.  Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[13]  Shigeki Goto,et al.  Identifying elephant flows through periodically sampled packets , 2004, IMC '04.

[14]  Christopher D. Carothers,et al.  ROSS: a high-performance, low memory, modular time warp system , 2000, PADS '00.

[15]  Michela Taufer,et al.  Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce , 2013, BMC Structural Biology.

[16]  Shane Snyder,et al.  A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems , 2014, PMBS@SC.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[19]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[20]  Robert B. Ross,et al.  Data-Aware Resource Scheduling for Multicloud Workflows: A Fine-Grained Simulation Approach , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[21]  Laxmikant V. Kalé,et al.  Simulating Large Scale Parallel Applications Using Statistical Models for Sequential Execution Blocks , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[22]  Cody Cutler,et al.  Trusted Disk Loading in the Emulab Network Testbed , 2010, CSET.

[23]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[24]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[25]  Yeh-Ching Chung,et al.  A multiple LID routing scheme for fat-tree-based InfiniBand networks , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[26]  Ke Wang,et al.  Exploring reliability of exascale systems through simulations , 2013, SpringSim.

[27]  Robert B. Ross,et al.  CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .

[28]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.