Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool

Due to the sheer volume of data it is typically impractical to analyze the detailed performance of an HPC application running at-scale. While conventional small-scale benchmarking and scaling studies are often sufficient for simple applications, many modern workflow-based applications couple multiple elements with competing resource demands and complex inter-communication patterns for which performance cannot easily be studied in isolation and at small scale. This work discusses Chimbuko, a performance analysis framework that provides real-time, in situ anomaly detection. By focusing specifically on performance anomalies and their origin (aka provenance), data volumes are dramatically reduced without losing necessary details. To the best of our knowledge, Chimbuko is the first online, distributed, and scalable workflow-level performance trace analysis framework. We demonstrate the tool’s usefulness on Oak Ridge National Laboratory’s Summit system.

[1]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[2]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[3]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[4]  Klaus Mueller,et al.  Exploratory Visual Analysis of Anomalous Runtime Behavior in Streaming High Performance Computing Applications , 2019, ICCS.

[5]  Lavanya Ramakrishnan,et al.  The future of scientific workflows , 2018, Int. J. High Perform. Comput. Appl..

[6]  Philippe Pierre Pebay,et al.  Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments. , 2008 .

[7]  Bernd Hamann,et al.  State of the Art of Performance Visualization , 2014, EuroVis.

[8]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[9]  Klaus Mueller,et al.  A Visual Analytics Framework for the Detection of Anomalous Call Stack Trees in High Performance Computing Applications , 2019, IEEE Transactions on Visualization and Computer Graphics.

[10]  Allen D. Malony,et al.  Portable profiling and tracing for parallel, scientific applications using C++ , 1998, SPDT '98.

[11]  Bernd Mohr,et al.  A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software with Templates , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[12]  Mark R. Fahey,et al.  User Environment Tracking and Problem Detection with XALT , 2014, 2014 First International Workshop on HPC User Support Tools.

[13]  SONAR: Automated Communication Characterization for HPC Applications , 2016, ISC Workshops.

[14]  Wei Xu,et al.  Capturing provenance as a diagnostic tool for workflow performance evaluation and optimization , 2017, 2017 New York Scientific Data Summit (NYSDS).

[15]  Bronis R. de Supinski,et al.  The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  William Gropp,et al.  Toward Scalable Performance Visualization with Jumpshot , 1999, Int. J. High Perform. Comput. Appl..

[17]  Wei Xu,et al.  Performance Visualization for TAU Instrumented Scientific Workflows , 2018, VISIGRAPP.

[18]  Nathan R. Tallent,et al.  HPCToolkit: performance tools for scientific computing , 2008 .

[19]  Ben Shneiderman,et al.  Readings in information visualization - using vision to think , 1999 .

[20]  Barton P. Miller,et al.  Dyninst and MRNet: Foundational Infrastructure for Parallel Tools , 2016 .

[21]  Dirk Schmidl,et al.  Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.