Scalable Performance Awareness for In Situ Scientific Applications

Part of the promise of exascale computing and the next generation of scientific simulation codes is the ability to bring together time and spatial scales that have traditionally been treated separately. This enables creating complex coupled simulations and in situ analysis pipelines, encompassing such things as "whole device" fusion models or the simulation of cities from sewers to rooftops. Unfortunately, the HPC analysis tools that have been built up over the preceding decades are ill suited to the debugging and performance analysis of such computational ensembles. In this paper, we present a new vision for performance measurement and understanding of HPC codes, MonitoringAnalytics (MONA). MONA is designed to be a flexible, high performance monitoring infrastructure that can perform monitoring analysis in place or in transit by embedding analytics and characterization directly into the data stream, without relying upon delivering all monitoring information to a central database for post-processing. It addresses the trade-offs between the prohibitively expensive capture of all performance characteristics and not capturing enough to detect the features of interest. We demonstrate several uses of MONA; capturing and indexing multi-executable performance profiles to enable later processing, extraction of performance primitives to enable the generation of customizable benchmarks and performance skeletons, and extracting communication and application behaviors to enable better control and placement for the current and future runs of the science ensemble. Relevant performance information based on a system for MONA built from ADIOS and SOSflow technologies is provided for DOE science applications and leadership machines.

[1]  Kesheng Wu,et al.  ICEE: Wide-area In Transit Data Processing Framework For Near Real-Time Scientific Applications , 2013 .

[2]  Vanish Talwar,et al.  A flexible architecture integrating monitoring and analytics for managing large-scale data centers , 2011, ICAC '11.

[3]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Karsten Schwan,et al.  Landrush: Rethinking In-Situ Analysis for GPGPU Workflows , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[5]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[6]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[7]  Karsten Schwan,et al.  SODA: Science-Driven Orchestration of Data Analytics , 2015, 2015 IEEE 11th International Conference on e-Science.

[8]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[9]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Karsten Schwan,et al.  Flexpath: Type-Based Publish/Subscribe System for Large-Scale Science Analytics , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[11]  Karsten Schwan,et al.  GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Scott Klasky,et al.  TGE: Machine Learning Based Task Graph Embedding for Large-Scale Topology Mapping , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  John Sellens RRDTool: Logging and Graphing , 2006, USENIX Annual Technical Conference, General Track.

[14]  Franck Cappello,et al.  Coupling Exascale Multiphysics Applications: Methods and Lessons Learned , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[15]  Ray W. Grout,et al.  Skel: Generative Software for Producing Skeletal I/O Applications , 2011, 2011 IEEE Seventh International Conference on e-Science Workshops.

[16]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[17]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[18]  Frank Jenko,et al.  Electron temperature gradient driven turbulence , 1999 .

[19]  Robert Hager,et al.  A new hybrid-Lagrangian numerical scheme for gyrokinetic simulation of tokamak edge plasma , 2016, J. Comput. Phys..

[20]  Karsten Schwan,et al.  Event-based systems: opportunities and challenges at exascale , 2009, DEBS '09.

[21]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[22]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[23]  Philippe Olivier Alexandre Navaux,et al.  DIMVHCM: An On-line Distributed Monitoring Data Collection Model , 2012, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[24]  Vanish Talwar,et al.  VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.

[25]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[26]  Allen D. Malony,et al.  A Scalable Observation System for Introspection and In Situ Analytics , 2016, 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT).

[27]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[28]  J. Manickam,et al.  Gyro-kinetic simulation of global turbulent transport properties in tokamak experiments , 2006 .

[29]  J. Choi,et al.  A tight-coupling scheme sharing minimum information across a spatial interface between gyrokinetic turbulence codes , 2018, Physics of Plasmas.

[30]  Nagiza F. Samatova,et al.  Compressed ion temperature gradient turbulence in diverted tokamak edge , 2009 .

[31]  S. Parker,et al.  A fully nonlinear characteristic method for gyrokinetic simulation , 1993 .

[32]  Barton P. Miller,et al.  A framework for scalable, parallel performance monitoring , 2010, Concurr. Comput. Pract. Exp..

[33]  Beth Plale,et al.  Big Provenance Stream Processing for Data Intensive Computations , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[34]  Scott Klasky,et al.  Extending Skel to Support the Development and Optimization of Next Generation I/O Systems , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).