Measuring Congestion in High-Performance Datacenter Interconnects
暂无分享,去创建一个
Larry Kaplan | Mike Showerman | Ravi Iyer | Zbigniew Kalbarczyk | Saurabh Jha | William Kramer | Archit Patke | Jim Brandt | Greg Bauer | Benjamin Lim | Ann C. Gentile | L. Kaplan | Saurabh Jha | M. Showerman | Z. Kalbarczyk | J. Brandt | A. Gentile | G. Bauer | W. Kramer | R. Iyer | Archit Patke | Benjamin Lim
[1] Tomohiro Inoue,et al. Tofu : Interconnect for the K computer , 2012 .
[2] Kevin T. Pedretti,et al. Demonstrating improved application performance using dynamic monitoring and task mapping , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).
[3] Laxmikant V. Kalé,et al. Maximizing Throughput on a Dragonfly Network , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[4] William J. Dally,et al. Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.
[5] Katherine E. Isaacs,et al. There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[6] Robert B. Ross,et al. Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[7] Trevor Blackwell,et al. Credit-based flow control for ATM networks: credit update protocol, adaptive credit allocation and statistical multiplexing , 1994, SIGCOMM 1994.
[8] Philip Heidelberger,et al. The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[9] Franck Cappello,et al. Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[10] Toshiyuki Shimizu,et al. Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers , 2009, Computer.
[11] Myungjin Lee,et al. Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.
[12] David A. Maltz,et al. Network traffic characteristics of data centers in the wild , 2010, IMC '10.
[13] Yi Zheng,et al. The TH Express high performance interconnect networks , 2014, Frontiers of Computer Science.
[14] Kevin T. Pedretti,et al. Overtime: a tool for analyzing performance variation due to network interference , 2015, ExaMPI '15.
[15] Larry Kaplan,et al. The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.
[16] Nick McKeown,et al. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.
[17] T. Rabbani,et al. SEGMENTATION OF POINT CLOUDS USING SMOOTHNESS CONSTRAINT , 2006 .
[18] Weiguo Liu,et al. End-to-end I/O Monitoring on Leading Supercomputers , 2022, NSDI.
[19] Ravishankar K. Iyer,et al. Live Forensics for Distributed Storage Systems , 2019, ArXiv.
[20] J. Enos,et al. Topology-Aware Job Scheduling Strategies for Torus Networks , 2014 .
[21] Ming Zhang,et al. Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..
[22] Vladimir Braverman,et al. One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.
[23] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[24] Charles E. Leiserson,et al. Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.
[25] Laxmikant V. Kalé,et al. Quantifying Network Contention on Large Parallel Machines , 2009, Parallel Process. Lett..
[26] Fumiyoshi Shoji,et al. Overview of the K computer System , 2012 .
[27] Dhabaleswar K. Panda,et al. Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[28] Kenichi Miura,et al. Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect , 2014, ISC.
[29] Onur Mutlu,et al. A Large Scale Study of Data Center Network Reliability , 2018, Internet Measurement Conference.
[30] Laxmikant V. Kalé,et al. Identifying the Culprits Behind Network Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[31] Stephen L. Olivier,et al. Exploiting Geometric Partitioning in Task Mapping for Parallel Computers , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[32] Wei Ge,et al. The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.
[33] Ravishankar K. Iyer,et al. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[34] Dongsu Han,et al. Credit-Scheduled Delay-Bounded Congestion Control for Datacenters , 2017, SIGCOMM.
[35] Courtenay T. Vaughan,et al. Using the Cray Gemini Performance Counters. , 2013 .
[36] Anirudh Sivaraman,et al. Demonstration of the Marple System for Network Performance Monitoring , 2017, SIGCOMM Posters and Demos.
[37] Amin Vahdat,et al. TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..
[38] Marianne Winslett,et al. A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.
[39] Celso L. Mendes,et al. Deploying a Large Petascale System: The Blue Waters Experience , 2014, ICCS.
[40] Gengbin Zheng,et al. A uGNI-based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[41] Thomas W. Tucker,et al. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[42] Mike Higgins,et al. Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[43] Myungjin Lee,et al. Simplifying Datacenter Network Debugging with PathDump , 2016, OSDI.
[44] D. Skinner,et al. Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..
[45] Mauricio O. Carneiro,et al. From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.
[46] Nicholas J. Wright,et al. Understanding Performance Variability on the Aries Dragonfly Network , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[47] Ravishankar K. Iyer,et al. Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters , 2018, IEEE Transactions on Dependable and Secure Computing.
[48] Ann C. Gentile,et al. Infrastructure for In Situ System Monitoring and Application Data Analysis , 2015, ISAV@SC.
[49] Valerio Pascucci,et al. Analyzing Network Health and Congestion in Dragonfly-Based Supercomputers , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[50] Daniel P. Siewiorek,et al. Error log analysis: statistical modeling and heuristic trend analysis , 1990 .
[51] Robert B. Ross,et al. Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[52] Ravishankar K. Iyer,et al. LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications , 2015, FTXS@HPDC.
[53] William Gropp,et al. The blue waters super-system for super-science , 2013 .
[54] R. Sisneros,et al. A Diagnostic Utility For Analyzing Periods Of Degraded Job Performance , 2014 .
[55] José Duato,et al. Efficient, Scalable Congestion Management for Interconnection Networks , 2006, IEEE Micro.
[56] Ravishankar K. Iyer,et al. A Study of Network Congestion in Two Supercomputing High-Speed Interconnects , 2019, 2019 IEEE Symposium on High-Performance Interconnects (HOTI).
[57] Laxmikant V. Kalé,et al. Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..
[58] Yijia Zhang,et al. Diagnosing Performance Variations in HPC Applications Using Machine Learning , 2017, ISC.
[59] A. Gentile,et al. Network Performance Counter Monitoring and Analysis on the Cray XC Platform. , 2016 .
[60] David Walker,et al. Compiling Path Queries , 2016, NSDI.
[61] Laxmikant V. Kalé,et al. Automatic topology mapping of diverse large-scale parallel applications , 2017, ICS '17.