Mitigating network noise on Dragonfly networks through application-aware routing

System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates more traffic since packets traverse additional hops, causing in turn congestion on other applications and on the application itself. In this paper, we first describe how to estimate network noise. By following these guidelines, we show how noise can be reduced by using routing algorithms which select minimal paths with a higher probability. We exploit this knowledge to design an algorithm which changes the probability of selecting minimal paths according to the application characteristics. We validate our solution on microbenchmarks and real-world applications on two systems relying on a Dragonfly interconnection network, showing noise reduction and performance improvement.

[1]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[2]  Robert B. Ross,et al.  Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Jens Domke,et al.  Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Joost VandeVondele,et al.  cp2k: atomistic simulations of condensed matter systems , 2014 .

[5]  Torsten Hoefler,et al.  Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks , 2014, HPDC '14.

[6]  Kevin Harms,et al.  Run-to-run Variability on Xeon Phi based Cray XC Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Gottlieb,et al.  Hybrid-molecular-dynamics algorithms for the numerical simulation of quantum chromodynamics. , 1987, Physical review. D, Particles and fields.

[8]  Torsten Hoefler,et al.  The impact of network noise at large-scale communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Jordan G. Powers,et al.  A Description of the Advanced Research WRF Version 2 , 2005 .

[10]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[11]  Torsten Hoefler,et al.  Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3_rmd , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[12]  Nan Jiang,et al.  Indirect adaptive routing on large scale interconnection networks , 2009, ISCA '09.

[13]  Ross C. Walker,et al.  An overview of the Amber biomolecular simulation package , 2013 .

[14]  G. Powers,et al.  A Description of the Advanced Research WRF Version 3 , 2008 .

[15]  Torsten Hoefler,et al.  Cost-effective diameter-two topologies: analysis and evaluation , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Torsten Hoefler,et al.  A Case for Standard Non-blocking Collective Operations , 2007, PVM/MPI.

[17]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[18]  Yanli Wang,et al.  Quantum ESPRESSO: a modular and open-source software project for quantum simulations of materials , 2009 .

[19]  Valerio Pascucci,et al.  Analyzing Network Health and Congestion in Dragonfly-Based Supercomputers , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[21]  Xin Yuan,et al.  Traffic Pattern-Based Adaptive Routing for Intra-Group Communication in Dragonfly Networks , 2016, 2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI).

[22]  A. Gentile,et al.  Network Performance Counter Monitoring and Analysis on the Cray XC Platform. , 2016 .

[23]  Laxmikant V. Kalé,et al.  Maximizing Throughput on a Dragonfly Network , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[25]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[26]  Laxmikant V. Kalé,et al.  Quantifying Network Contention on Large Parallel Machines , 2009, Parallel Process. Lett..

[27]  Nicholas J. Wright,et al.  Understanding Performance Variability on the Aries Dragonfly Network , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[28]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[29]  Xin Yuan,et al.  A study of process arrival patterns for MPI collective operations , 2007, ICS.

[30]  Kevin T. Pedretti,et al.  Overtime: a tool for analyzing performance variation due to network interference , 2015, ExaMPI '15.

[31]  John Kim,et al.  Overcoming far-end congestion in large-scale networks , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[32]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[33]  D. Roweth,et al.  Cray XC ® Series Network , 2012 .

[34]  Torsten Hoefler,et al.  Slim Fly: A Cost Effective Low-Diameter Network Topology , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Mateo Valero,et al.  Contention-Based Nonminimal Adaptive Routing in High-Radix Networks , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[36]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[37]  Mark Giampapa,et al.  Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Abhinav Bhatele,et al.  Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  J. Dongarra,et al.  HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems∗ , 2015 .

[40]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  H.M. Tufo,et al.  Terascale Spectral Element Algorithms and Implementations , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[42]  Robert B. Ross,et al.  Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).