MAP: A Visual Analytics System for Job Monitoring and Analysis

High-performance computing systems are used for compute-intensive jobs by multiple users. They submit jobs to batch queues where the jobs are queued for an unknown amount of time until the required resources are available. A large amount of data is collected by the resource managers regarding the jobs (submit time, start time, end time, resource requirements, etc.). Analyzing this data may help identify causes of problems that may have occurred in the past and better optimize the system. Analyzing complex and huge logs may be cumbersome. We have developed a unified job monitoring, analysis, and prediction system using which users can monitor current state, analyze past job logs, and predict wait-times of future jobs. In this paper, we have focused on the job monitoring and analysis modules.

[1]  Ling Zhang,et al.  An Improved Ganglia-Like Clusters Monitoring System , 2003, GCC.

[2]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[3]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Vipul K. Dabhi,et al.  Embedding custom metric in ganglia monitoring system , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[5]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[6]  HeerJeffrey,et al.  D3 Data-Driven Documents , 2011 .

[7]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[8]  Amanda Bonnie,et al.  Design and Implementation of a Scalable HPC Monitoring System , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[9]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[10]  James C. Browne,et al.  Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources , 2015, Computing in Science & Engineering.