Efficient Runtime Capture of Multiworkflow Data Using Provenance

Computational Science and Engineering (CSE) projects are typically developed by multidisciplinary teams. Despite being part of the same project, each team manages its own workflows, using specific execution environments and data processing tools. Analyzing the data processed by all workflows globally is a core task in a CSE project. However, this analysis is hard because the data generated by these workflows are not integrated. In addition, since these workflows may take a long time to execute, data analysis needs to be done at runtime to reduce cost and time of the CSE project. A typical solution in scientific data analysis is to capture and relate the data in a provenance database while the workflows run, thus allowing for data analysis at runtime. However, the main problem is that such data capture competes with the running workflows, adding significant overhead to their execution. To mitigate this problem, we introduce in this paper a system called ProvLake, which adopts design principles for providing efficient distributed data capture from the workflows. While capturing the data, ProvLake logically integrates and ingests them into a provenance database ready for analyses at runtime. We validated ProvLake in a real use case in the O&G industry encompassing four workflows that process 5 TB datasets for a deep learning classifier. Compared with Komadu, the closest solution that meets our goals, our approach enables runtime multiworkflow data analysis with much smaller overhead, such as 0.1%.

[1]  T. Randen,et al.  Three-Dimensional Texture Attributes For Seismic Data Analysis , 2000 .

[2]  Verena Kantere,et al.  Optimizing, Planning and Executing Analytics Workflows over Multiple Engines , 2016, EDBT/ICDT Workshops.

[3]  Tram Truong Huu,et al.  Bundle and Pool Architecture for Multi-Language, Robust, Scalable Workflow Executions , 2013, Journal of Grid Computing.

[4]  Johan Montagnat,et al.  Scientific workflows: Past, present and future , 2017, Future Gener. Comput. Syst..

[5]  Luc Moreau,et al.  PROV-Overview. An Overview of the PROV Family of Documents , 2013 .

[6]  Scott Klasky,et al.  In Situ Methods, Infrastructures, and Applications on High Performance Computing Platforms , 2016, Comput. Graph. Forum.

[7]  Marta Mattoso,et al.  DfAnalyzer: Runtime Dataflow Analysis of Scientific Applications using Provenance , 2018, Proc. VLDB Endow..

[8]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[9]  Juliana Freire,et al.  noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts , 2017, Proc. VLDB Endow..

[10]  Marta Mattoso,et al.  Capturing Provenance for Runtime Data Analysis in Computational Science and Engineering Applications , 2018, IPAW.

[11]  Kevin Wilkinson,et al.  Optimizing analytic data flows for multiple execution engines , 2012, SIGMOD Conference.

[12]  Melanie Herschel,et al.  A survey on provenance: What for? What form? What from? , 2017, The VLDB Journal.

[13]  Marta Mattoso,et al.  Keeping Track of User Steering Actions in Dynamic Workflows , 2019, Future Gener. Comput. Syst..

[14]  Alban Gaignard,et al.  SHARP: Harmonizing and Bridging Cross-Workflow Provenance , 2017, ESWC.

[15]  Paolo Missier,et al.  Linking multiple workflow provenance traces for interoperable collaborative science , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[16]  Beth Plale,et al.  Crossing analytics systems: A case for integrated provenance in data lakes , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[17]  Bianca Zadrozny,et al.  Efficient Classification of Seismic Textures , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[18]  Cláudio T. Silva,et al.  VisTrails: enabling interactive multiple-view visualizations , 2005, VIS 05. IEEE Visualization, 2005..

[19]  K. M. Barry,et al.  RECOMMENDED STANDARDS FOR DIGITAL TAPE FORMATS , 1975 .

[20]  Chris North,et al.  Intelligent systems for geosciences , 2018, Communications of the ACM.

[21]  Beth Plale,et al.  Big Provenance Stream Processing for Data Intensive Computations , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).