High Performance Cluster Monitoring System

System monitoring is an important basis for system modelling and improvement. For achieving higher efficiency in performance and lower energy consumption in cluster systems, we design a monitoring system that tracks the performance and temperature of clusters for education and research purposes. The total energy consumption of clusters can be estimated by using the performance and temperature data. Specifically, in our proposed system, all real-time data (including temperature and activities of components of computing nodes) is collected and stored in Round-Robin Databases (RRDs). These data can be visualized or downloaded through a friendly user interface for further analysis. Moreover, our system also provides users with a powerful runtime comparison feature, which allows users to compare the performance of a running experiment with historical experimental results without waiting for the completion of experiments. The data visualization and user interfaces in the monitoring system are demonstrated by using an experiment on our cluster system.

[1]  Min Li,et al.  HPC Cluster Monitoring System Architecture Design and Implement , 2009, 2009 Second International Conference on Intelligent Computation Technology and Automation.

[2]  Gerhard Wellein,et al.  LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Rahul Neware,et al.  Oracle Real Application Clusters , 2011 .

[4]  HeerJeffrey,et al.  D3 Data-Driven Documents , 2011 .

[5]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[6]  Richard E. Brown,et al.  Report to Congress on Server and Data Center Energy Efficiency: Public Law 109-431 , 2008 .

[7]  Meikang Qiu,et al.  Thermal Modeling and Management of Storage Systems in Data Centers , 2015, Handbook on Data Centers.

[8]  Antonio Gómez-Iglesias,et al.  Practical Monitoring of Resource Utilization for HPC Applications , 2016, XSEDE.

[9]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[10]  Xiao Qin,et al.  Eco-Storage: A Hybrid Storage System with Energy-Efficient Informed Prefetching , 2013, J. Signal Process. Syst..

[11]  Michael Mason,et al.  Monitoring High Performance Computing Systems for the End User , 2015, 2015 IEEE International Conference on Cluster Computing.