Workflow provenance in the lifecycle of scientific machine learning

Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments show that the principles enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability, and an order of magnitude of query acceleration under certain workloads against without our representation.

[1]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[2]  Sameep Mehta,et al.  On Efficiently Processing Workflow Provenance Queries in Spark , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[3]  Abdul Quamar,et al.  Property Graph Schema Optimization for Domain-Specific Knowledge Graphs , 2020, ArXiv.

[4]  Alessandro Spinuso,et al.  Active Provenance for Data-Intensive Workflows: Engaging Users and Developers , 2019, 2019 15th International Conference on eScience (eScience).

[5]  Boris Glavic,et al.  Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances , 2019, SIGMOD Conference.

[6]  Paolo Missier,et al.  Facilitating reproducible research by investigating computational metadata , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[7]  Marta Mattoso,et al.  Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering , 2019, 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS).

[8]  Marta Mattoso,et al.  Efficient Runtime Capture of Multiworkflow Data Using Provenance , 2019, 2019 15th International Conference on eScience (eScience).

[9]  Cláudio T. Silva,et al.  Bridging Workflow and Data Provenance Using Strong Links , 2010, SSDBM.

[10]  Marta Mattoso,et al.  Distributed in-memory data management for workflow executions , 2021, PeerJ Comput. Sci..

[11]  Sachin Shetty,et al.  ProvChain: A Blockchain-Based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[12]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[13]  Beth Plale,et al.  Crossing analytics systems: A case for integrated provenance in data lakes , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[14]  Peter Buneman,et al.  Data Provenance: What next? , 2019, SGMD.

[15]  Marta Mattoso,et al.  Provenance of Dynamic Adaptations in User-Steered Dataflows , 2018, IPAW.

[16]  Rizos Sakellariou,et al.  The role of machine learning in scientific workflows , 2019, Int. J. High Perform. Comput. Appl..

[17]  Dennis Shasha,et al.  BugDoc: Algorithms to Debug Computational Processes , 2020, SIGMOD Conference.

[18]  Larry S. Davis,et al.  ModelHub: Deep Learning Lifecycle Management , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[19]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[20]  Marta Mattoso,et al.  DfAnalyzer: Runtime Dataflow Analysis of Scientific Applications using Provenance , 2018, Proc. VLDB Endow..

[21]  Jens Lehmann,et al.  MEX vocabulary: a lightweight interchange format for machine learning experiments , 2015, SEMANTICS.

[22]  Marta Mattoso,et al.  Capturing and Analyzing Provenance from Spark-based Scientific Workflows with SAMbA-RaP , 2020, Future Gener. Comput. Syst..

[23]  Marta Mattoso,et al.  Capturing and querying workflow runtime provenance with PROV: a practical approach , 2013, EDBT '13.

[24]  Renée J. Miller,et al.  Data Lake Management: Challenges and Opportunities , 2019, Proc. VLDB Endow..

[25]  Paris Perdikaris,et al.  Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , 2019, J. Comput. Phys..

[26]  Raphael Thiago,et al.  Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case , 2020, First EAGE Digitalization Conference and Exhibition.

[27]  Marta Mattoso,et al.  Towards a Human-in-the-Loop Library for Tracking Hyperparameter Tuning in Deep Learning Development , 2018, LADaS@VLDB.

[28]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[29]  Daniel de Oliveira,et al.  Polyflow: A SOA for Analyzing Workflow Heterogeneous Provenance Data in Distributed Environments , 2019, SBSI.

[30]  Matei Zaharia,et al.  Provenance Analysis for Missing Answers and Integrity Repairs. , 2018 .

[31]  Fotis Psallidas,et al.  Vamsa: Tracking Provenance in Data Science Scripts , 2020, ArXiv.

[32]  Juliana Freire,et al.  noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts , 2017, Proc. VLDB Endow..

[33]  Rosa Filgueira,et al.  DARE to Perform Seismological Workflows , 2019 .

[34]  Rizos Sakellariou,et al.  A characterization of workflow management systems for extreme-scale applications , 2016, Future Gener. Comput. Syst..

[35]  Márcio Ferreira Moreno,et al.  Managing Machine Learning Workflow Components , 2020, 2020 IEEE 14th International Conference on Semantic Computing (ICSC).

[36]  Magda Balazinska,et al.  The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact? , 2020, SIGMOD Conference.

[37]  Ian Foster,et al.  Special Issue: The First Provenance Challenge , 2008 .

[38]  Lukas Rupprecht,et al.  Improving reproducibility of data science pipelines through transparent provenance capture , 2020, Proc. VLDB Endow..

[39]  Dan Feng,et al.  Efficient Provenance Management via Clustering and Hybrid Storage in Big Data Environments , 2020, IEEE Transactions on Big Data.

[40]  Kush R. Varshney,et al.  Increasing Trust in AI Services through Supplier's Declarations of Conformity , 2018, IBM J. Res. Dev..

[41]  Joaquin Vanschoren,et al.  ML-Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies , 2018, ICML 2018.

[42]  Marta Mattoso,et al.  How Much Domain Data Should Be in Provenance Databases? , 2015, TaPP.

[43]  Marta Mattoso,et al.  Towards supporting the life cycle of large scale scientific experiments , 2010, Int. J. Bus. Process. Integr. Manag..

[44]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[45]  Vasa Curcin,et al.  Abstracting PROV provenance graphs: A validity-preserving approach , 2020, Future Gener. Comput. Syst..

[46]  Juliana Freire,et al.  A Survey on Collecting, Managing, and Analyzing Provenance from Scripts , 2019, ACM Comput. Surv..

[47]  Markus Weimer,et al.  Vamsa: Automated Provenance Tracking in Data Science Scripts , 2020, KDD.

[48]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[49]  Chris North,et al.  Intelligent systems for geosciences , 2018, Communications of the ACM.

[50]  Leslie F. Sikos,et al.  Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs , 2020, Data Science and Engineering.

[51]  Paolo Missier,et al.  Linking multiple workflow provenance traces for interoperable collaborative science , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[52]  Daniel de Oliveira,et al.  Analyzing related raw data files through dataflows , 2016, Concurr. Comput. Pract. Exp..

[53]  Sebastian Schelter,et al.  Automatically Tracking Metadata and Provenance of Machine Learning Experiments , 2017 .

[54]  Birgitta König-Ries,et al.  Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles , 2020, IPAW.

[55]  Paolo Missier,et al.  Exploiting Execution Provenance to Explain Difference Between Two Data-Intensive Computations , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[56]  Larry S. Davis,et al.  Towards Unified Data and Lifecycle Management for Deep Learning , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[57]  Matthew J. Turk,et al.  Toward Enabling Reproducibility for Data-Intensive Research Using the Whole Tale Platform , 2020, PARCO.

[58]  Bianca Zadrozny,et al.  Efficient Classification of Seismic Textures , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[59]  Marta Mattoso,et al.  UNCERTAINTY QUANTIFICATION IN COMPUTATIONAL PREDICTIVE MODELS FOR FLUID DYNAMICS USING A WORKFLOW MANAGEMENT ENGINE , 2012 .

[60]  Marta Mattoso,et al.  Data reduction in scientific workflows using provenance monitoring and user steering , 2020, Future Gener. Comput. Syst..

[61]  Marta Mattoso,et al.  Keeping Track of User Steering Actions in Dynamic Workflows , 2019, Future Gener. Comput. Syst..

[62]  Melanie Herschel,et al.  A survey on provenance: What for? What form? What from? , 2017, The VLDB Journal.

[63]  Marco Aurélio Stelmar Netto,et al.  DeepDownscale: A Deep Learning Strategy for High-Resolution Weather Forecast , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[64]  Marta Mattoso,et al.  Análise de Hiperparâmetros em Aplicações de Aprendizado Profundo por meio de Dados de Proveniência , 2019, SBBD.

[65]  Márcio Ferreira Moreno,et al.  A Knowledge-Based Approach for Structuring Cyclic Workflows , 2020, SEMWEB.

[66]  Yolanda Gil,et al.  FAIR Computational Workflows , 2020, Data Intelligence.