Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 239,616 CUDA cores in parallel.

[1]  Zhao Zhang,et al.  Diagnosing Machine Learning Pipelines with Fine-grained Lineage , 2017, HPDC.

[2]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[3]  Marco Aurélio Stelmar Netto,et al.  DeepDownscale: A Deep Learning Strategy for High-Resolution Weather Forecast , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[5]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[6]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[7]  Joaquin Vanschoren,et al.  ML-Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies , 2018, ICML 2018.

[8]  Rizos Sakellariou,et al.  The role of machine learning in scientific workflows , 2019, Int. J. High Perform. Comput. Appl..

[9]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[10]  Beth Plale,et al.  Big Provenance Stream Processing for Data Intensive Computations , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[11]  Jens Lehmann,et al.  MEX vocabulary: a lightweight interchange format for machine learning experiments , 2015, SEMANTICS.

[12]  Melanie Herschel,et al.  A survey on provenance: What for? What form? What from? , 2017, The VLDB Journal.

[13]  Paris Perdikaris,et al.  Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , 2019, J. Comput. Phys..

[14]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[15]  Marta Mattoso,et al.  Keeping Track of User Steering Actions in Dynamic Workflows , 2019, Future Gener. Comput. Syst..

[16]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[17]  Juliana Freire,et al.  A Survey on Collecting, Managing, and Analyzing Provenance from Scripts , 2019, ACM Comput. Surv..

[18]  Larry S. Davis,et al.  Towards Unified Data and Lifecycle Management for Deep Learning , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[19]  Marta Mattoso,et al.  DfAnalyzer: Runtime Dataflow Analysis of Scientific Applications using Provenance , 2018, Proc. VLDB Endow..

[20]  Marta Mattoso,et al.  Data reduction in scientific workflows using provenance monitoring and user steering , 2020, Future Gener. Comput. Syst..

[21]  Yolanda Gil,et al.  PROV-DM: The PROV Data Model , 2013 .

[22]  Marta Mattoso,et al.  Efficient Runtime Capture of Multiworkflow Data Using Provenance , 2019, 2019 15th International Conference on eScience (eScience).

[23]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[24]  Marta Mattoso,et al.  Capturing Provenance for Runtime Data Analysis in Computational Science and Engineering Applications , 2018, IPAW.

[25]  Bianca Zadrozny,et al.  Efficient Classification of Seismic Textures , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).