Watch Out for the Bully! Job Interference Study on Dragonfly Network

High-radix, low-diameter dragonfly networks will be a common choice in next-generation supercomputers. Preliminary studies show that random job placement with adaptive routing should be the rule of thumb to utilize such networks, since it uniformly distributes traffic and alleviates congestion. Nevertheless, in this work we find that while random job placement coupled with adaptive routing is good at load balancing network traffic, it cannot guarantee the best performance for every job. The performance improvement of communication-intensive applications comes at the expense of performance degradation of less intensive ones. We identify this bully behavior and validate its underlying causes with the help of detailed network simulation and real application traces. We further investigate a hybrid contiguous-noncontiguous job placement policy as an alternative. Initial experimentation shows that hybrid job placement aids in reducing the worst-case performance degradation for less communication-intensive applications while retaining the performance of communication-intensive ones.

[1]  Robert B. Ross,et al.  A case study in using massively parallel simulation for extreme-scale torus network codesign , 2014, SIGSIM PADS '14.

[2]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[3]  Nan Jiang,et al.  Indirect adaptive routing on large scale interconnection networks , 2009, ISCA '09.

[4]  Christopher D. Carothers,et al.  Warp speed: executing time warp on 1,966,080 cores , 2013, SIGSIM-PADS.

[5]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Christopher D. Carothers,et al.  ROSS: a high-performance, low memory, modular time warp system , 2000, PADS '00.

[7]  Robert B. Ross,et al.  Modeling a Million-Node Slim Fly Network Using Parallel Discrete-Event Simulation , 2016, SIGSIM-PADS.

[8]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[9]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Xu Yang,et al.  Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[11]  John Kim,et al.  Overcoming far-end congestion in large-scale networks , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[12]  Kevin T. Pedretti,et al.  Demonstrating improved application performance using dynamic monitoring and task mapping , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Christopher D. Carothers,et al.  Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation , 2011, 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation.

[15]  Laxmikant V. Kalé,et al.  Avoiding hot-spots on two-level direct networks , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Torsten Hoefler,et al.  Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks , 2014, HPDC '14.

[17]  Robert B. Ross,et al.  Enabling Parallel Simulation of Large-Scale HPC Network Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[18]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[19]  Cyriel Minkenberg,et al.  Quiet Neighborhoods: Key to Protect Job Performance Predictability , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[20]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[21]  Laxmikant V. Kalé,et al.  Maximizing Throughput on a Dragonfly Network , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Robert B. Ross,et al.  Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[23]  William J. Dally,et al.  Cost-Efficient Dragonfly Topology for Large-Scale Systems , 2009, IEEE Micro.

[24]  Valerio Pascucci,et al.  Analyzing Network Health and Congestion in Dragonfly-Based Supercomputers , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[25]  Robert B. Ross,et al.  CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .