Large Scale System Monitoring and Analysis on Blue Waters using OVIS.

Understanding the complex interplay between applications competing for shared platform resources can be key to maximizing both platform and application performance. At the same time, use of monitoring tools on platforms designed to support extreme scale applications presents a number of challenges with respect to scaling and impact on applications due to increased noise and jitter. In this paper, we present our approach to high fidelity whole system monitoring of resource utilization including High Speed Network link data on NCSA’s Cray XE/XK platform Blue Waters utilizing the OVIS monitoring framework. We then describe architectural implementation details that make this monitoring system suitable for scalable monitoring within the Cray hardware and software environment. Finally we present our methodologies for measuring impact and the results.