Measuring Congestion in High-Performance Datacenter Interconnects

While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world’s largest 3D torus network of Blue Waters, a 13.3petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.

[1]  Tomohiro Inoue,et al.  Tofu : Interconnect for the K computer , 2012 .

[2]  Kevin T. Pedretti,et al.  Demonstrating improved application performance using dynamic monitoring and task mapping , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Laxmikant V. Kalé,et al.  Maximizing Throughput on a Dragonfly Network , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[5]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Robert B. Ross,et al.  Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Trevor Blackwell,et al.  Credit-based flow control for ATM networks: credit update protocol, adaptive credit allocation and statistical multiplexing , 1994, SIGCOMM 1994.

[8]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Franck Cappello,et al.  Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[10]  Toshiyuki Shimizu,et al.  Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers , 2009, Computer.

[11]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.

[12]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[13]  Yi Zheng,et al.  The TH Express high performance interconnect networks , 2014, Frontiers of Computer Science.

[14]  Kevin T. Pedretti,et al.  Overtime: a tool for analyzing performance variation due to network interference , 2015, ExaMPI '15.

[15]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[16]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[17]  T. Rabbani,et al.  SEGMENTATION OF POINT CLOUDS USING SMOOTHNESS CONSTRAINT , 2006 .

[18]  Weiguo Liu,et al.  End-to-end I/O Monitoring on Leading Supercomputers , 2022, NSDI.

[19]  Ravishankar K. Iyer,et al.  Live Forensics for Distributed Storage Systems , 2019, ArXiv.

[20]  J. Enos,et al.  Topology-Aware Job Scheduling Strategies for Torus Networks , 2014 .

[21]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[22]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[23]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[24]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[25]  Laxmikant V. Kalé,et al.  Quantifying Network Contention on Large Parallel Machines , 2009, Parallel Process. Lett..

[26]  Fumiyoshi Shoji,et al.  Overview of the K computer System , 2012 .

[27]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Kenichi Miura,et al.  Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect , 2014, ISC.

[29]  Onur Mutlu,et al.  A Large Scale Study of Data Center Network Reliability , 2018, Internet Measurement Conference.

[30]  Laxmikant V. Kalé,et al.  Identifying the Culprits Behind Network Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[31]  Stephen L. Olivier,et al.  Exploiting Geometric Partitioning in Task Mapping for Parallel Computers , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[32]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[33]  Ravishankar K. Iyer,et al.  Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[34]  Dongsu Han,et al.  Credit-Scheduled Delay-Bounded Congestion Control for Datacenters , 2017, SIGCOMM.

[35]  Courtenay T. Vaughan,et al.  Using the Cray Gemini Performance Counters. , 2013 .

[36]  Anirudh Sivaraman,et al.  Demonstration of the Marple System for Network Performance Monitoring , 2017, SIGCOMM Posters and Demos.

[37]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[38]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[39]  Celso L. Mendes,et al.  Deploying a Large Petascale System: The Blue Waters Experience , 2014, ICCS.

[40]  Gengbin Zheng,et al.  A uGNI-based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[41]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Myungjin Lee,et al.  Simplifying Datacenter Network Debugging with PathDump , 2016, OSDI.

[44]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[45]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[46]  Nicholas J. Wright,et al.  Understanding Performance Variability on the Aries Dragonfly Network , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[47]  Ravishankar K. Iyer,et al.  Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters , 2018, IEEE Transactions on Dependable and Secure Computing.

[48]  Ann C. Gentile,et al.  Infrastructure for In Situ System Monitoring and Application Data Analysis , 2015, ISAV@SC.

[49]  Valerio Pascucci,et al.  Analyzing Network Health and Congestion in Dragonfly-Based Supercomputers , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[50]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[51]  Robert B. Ross,et al.  Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[52]  Ravishankar K. Iyer,et al.  LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications , 2015, FTXS@HPDC.

[53]  William Gropp,et al.  The blue waters super-system for super-science , 2013 .

[54]  R. Sisneros,et al.  A Diagnostic Utility For Analyzing Periods Of Degraded Job Performance , 2014 .

[55]  José Duato,et al.  Efficient, Scalable Congestion Management for Interconnection Networks , 2006, IEEE Micro.

[56]  Ravishankar K. Iyer,et al.  A Study of Network Congestion in Two Supercomputing High-Speed Interconnects , 2019, 2019 IEEE Symposium on High-Performance Interconnects (HOTI).

[57]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[58]  Yijia Zhang,et al.  Diagnosing Performance Variations in HPC Applications Using Machine Learning , 2017, ISC.

[59]  A. Gentile,et al.  Network Performance Counter Monitoring and Analysis on the Cray XC Platform. , 2016 .

[60]  David Walker,et al.  Compiling Path Queries , 2016, NSDI.

[61]  Laxmikant V. Kalé,et al.  Automatic topology mapping of diverse large-scale parallel applications , 2017, ICS '17.