Ovis-2: A robust distributed architecture for scalable RAS

Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health.

[1]  Ann C. Gentile,et al.  Meaningful Automated Statistical Analysis of Large Computational Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[2]  Ann C. Gentile,et al.  OVIS: a tool for intelligent, real-time monitoring of computational clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.