Prescriptive provenance for streaming analysis of workflows at scale

We extend our approach capturing and relating the provenance and performance metrics of computational workflows as a diagnostic tool for runtime optimization and placement. One important challenge is the volume of extracted data, both for performance metrics and provenance, even when specifying filters and focusing on quantities of interest in a simulation. We reduce this data by performing anomaly detection on streaming data and store provenance for the detected anomalies, an approach we call prescriptive provenance. This paper discusses the Chimbuko architecture enabling the approach. We present the use of a protein structure propagation workflow based on NWChemEx. We are testing algorithms for anomaly detection and present preliminary results here obtained with Local Outlier Factor. While scaling remains a challenge, these results show that our robust Chimbuko architecture for streaming analysis with prescriptive provenance is a promising approach.

[1]  Mahsa Salehi,et al.  A Relevance Weighted Ensemble Model for Anomaly Detection in Switching Data Streams , 2014, PAKDD.

[2]  Wei Xu,et al.  Capturing provenance as a diagnostic tool for workflow performance evaluation and optimization , 2017, 2017 New York Scientific Data Summit (NYSDS).

[3]  P. Kollman,et al.  A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules , 1995 .

[4]  Jianwu Wang,et al.  Big data provenance: Challenges, state of the art and opportunities , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[5]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[6]  Wei Xu,et al.  Performance Visualization for TAU Instrumented Scientific Workflows , 2018, VISIGRAPP.

[7]  Allen D. Malony,et al.  A Scalable Observation System for Introspection and In Situ Analytics , 2016, 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT).

[8]  Burkhard Rost,et al.  Structural basis for a pH-sensitive calcium leak across membranes , 2014, Science.

[9]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[10]  Hao Huang,et al.  Physics-Based Anomaly Detection Defined on Manifold Space , 2014, TKDD.

[11]  Alexander Goncearenco,et al.  Coupling between Histone Conformations and DNA Geometry in Nucleosomes on a Microsecond Timescale: Atomistic Insights into Nucleosome Functions. , 2016, Journal of molecular biology.

[12]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[13]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[14]  Li Tang,et al.  Use Cases of Computational Reproducibility for Scientific Workflows at Exascale , 2018, ArXiv.