A Review of Supercomputer Performance Monitoring Systems
暂无分享,去创建一个
Ashish Ranjan | Vladimir V. Voevodin | Sucheta Pawar | Sanjay Wandhekar | Konstantin S. Stefanov | V. Voevodin | Ashish Ranjan | Sucheta Pawar | K. Stefanov | Sanjay Wandhekar
[1] David E. Culler,et al. The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..
[2] Allen D. Malony,et al. The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..
[3] Ronald Minnich. Supermon: High-Performance Monitoring for Linux Clusters , 2001, Annual Linux Showcase & Conference.
[4] Hadi Sharifi,et al. Push Me Pull You: Integrating Opposing Data Transport Modes for Efficient HPC Application Monitoring , 2015, 2015 IEEE International Conference on Cluster Computing.
[5] Mark R. Fahey,et al. User Environment Tracking and Problem Detection with XALT , 2014, 2014 First International Workshop on HPC User Support Tools.
[6] Konstantin Stefanov,et al. Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon) , 2015 .
[7] Ronald Minnich,et al. Supermon: a high-speed cluster monitoring system , 2002, Proceedings. IEEE International Conference on Cluster Computing.
[8] Bert J. Debusschere,et al. Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[9] James C. Browne,et al. Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats , 2014, 2014 First International Workshop on HPC User Support Tools.
[10] Jie Li,et al. MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).
[11] R. Scott Studham,et al. NWPerf: a system wide performance monitoring tool for large Linux clusters , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[12] B.P. Miller,et al. MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[13] Gerhard Wellein,et al. LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[14] Frank Mueller,et al. Desh: deep learning for system health prediction of lead times to failure in HPC , 2018, HPDC.
[15] Michael Kluge,et al. Mapping of RAID Controller Performance Data to the Job History on Large Computing Systems , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.
[16] James C. Browne,et al. Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources , 2015, Computing in Science & Engineering.
[17] John M. May,et al. MPX: Software for multiplexing hardware performance counters in multithreaded programs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[18] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..
[19] Ahmad Yasin,et al. A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[20] Wolfgang E. Nagel,et al. Collecting Distributed Performance Data with Dataheap: Generating and Exploiting a Holistic System View , 2012, ICCS.
[21] Rajkumar Buyya,et al. PARMON: a portable and scalable monitoring system for clusters , 2000 .
[22] Wolfgang Frings,et al. Scalable Control and Monitoring of Supercomputer Applications Using an Integrated Tool Framework , 2011, 2011 40th International Conference on Parallel Processing Workshops.
[23] Thomas W. Tucker,et al. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[24] James C. Browne,et al. Enabling comprehensive data-driven system management for large computational facilities , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[25] Jeanine Cook,et al. Improved estimation for software multiplexing of performance counters , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.