Capturing and Analyzing Provenance from Spark-based Scientific Workflows with SAMbA-RaP

Abstract Several scientists have moved their IO- and CPU-intensive workflows to Data-Intensive Scalable Computing (DISC) frameworks aiming at benefit from high scalability, broad support, and manufacturers’ infrastructure. A prominent framework is Apache Spark, which has been on an absolute tear over the last ten years and became one of the most widely used technologies in big data. Apache Spark brings several advantages along, as granting very efficient in-memory data management for large-scale applications through Resilient Distributed Datasets (RDDs). Such an in-memory replacement for MapReduce enables data handling activities of scientific workflows to be executed orders of magnitude faster in comparison to other DISC environments. A major drawback, however, is Apache Spark still lacks support for both data tracking and workflow provenance. Accordingly, the sole alternative for users that rely on provenance features is to spend countless hours collecting data from log files. Moreover, as one additional challenge, Apache Spark interprets legacy programs within workflows as “black-box” activities, which prevents the capture and analysis of data movements through RDDs. This manuscript presents the SAMbA-RaP (Spark provenAnce MAnagement with Reports and Presentation) solution for capturing, storing, and querying prospective and retrospective provenance, as well as domain data within distributed scientific workflows. SAMbA-RaP performance was evaluated upon real workflow cases ( SciPhy , Montage , WordCount , BuzzFlow , and SalesForecasts ) from distinct domains, e.g., literature, bioinformatics and astronomy, and results indicate the average imposed overhead for managing provenance data is acceptable. Moreover, experiments also indicate our solution is capable of handling workflows with and without legacy applications alike, which enables users to query and verify provenance data on SAMbA-RaP reports straightforwardly and transparently.

[1]  Jennifer Widom,et al.  Logical provenance in data-oriented workflows? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[2]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[3]  Marta Mattoso,et al.  Algebraic dataflows for big data analysis , 2013, 2013 IEEE International Conference on Big Data.

[4]  Marta Mattoso,et al.  Raw data queries during data-intensive parallel workflow execution , 2017, Future Gener. Comput. Syst..

[5]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[6]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[7]  Lúcia Maria de A. Drummond,et al.  Optimizing virtual machine allocation for parallel scientific workflows in federated clouds , 2015, Future Gener. Comput. Syst..

[8]  Zhao Zhang,et al.  Kira: Processing Astronomy Imagery Using Big Data Technology , 2020, IEEE Transactions on Big Data.

[9]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[10]  Yolanda Gil,et al.  PROV-DM: The PROV Data Model , 2013 .

[11]  Jörg K. Wegner,et al.  Scaling machine learning for target prediction in drug discovery using Apache Spark , 2017, Future Gener. Comput. Syst..

[12]  Marta Mattoso,et al.  SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  Wellington Moreira de Oliveira,et al.  Provenance Analytics for Workflow-Based Computational Experiments , 2018, ACM Comput. Surv..

[15]  Lisandro Dalcin,et al.  Parallel distributed computing using Python , 2011 .

[16]  Reynold Xin,et al.  Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..

[17]  Jesús Carretero,et al.  Flexible Data-Aware Scheduling for Workflows over an In-memory Object Store , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[19]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[20]  Marta Mattoso,et al.  SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes , 2011, BSB.

[21]  Jorge Amigo,et al.  SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data , 2016, PloS one.

[22]  Miryung Kim,et al.  BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[23]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[24]  Marta Mattoso,et al.  Towards supporting the life cycle of large scale scientific experiments , 2010, Int. J. Bus. Process. Integr. Manag..

[25]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[26]  Marta Mattoso,et al.  A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows , 2018, 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS).

[27]  Marta Mattoso,et al.  A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds , 2012, Journal of Grid Computing.

[28]  Daniel S. Katz,et al.  Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking , 2009, Int. J. Comput. Sci. Eng..

[29]  Ching-Hsien Hsu,et al.  Distributed Metaserver Mechanism and Recovery Mechanism Support in Quantcast File System , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[30]  Daniel Crawl,et al.  Workflows and extensions to the Kepler scientific workflow system to support environmental sensor data access and analysis , 2010, Ecol. Informatics.

[31]  Marta Mattoso,et al.  Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud , 2017, Trans. Large Scale Data Knowl. Centered Syst..

[32]  Hojjat Jafarpour Quantcast File System (QFS) , 2013, CIDR.

[33]  Marta Mattoso,et al.  DfAnalyzer: Runtime Dataflow Analysis of Scientific Applications using Provenance , 2018, Proc. VLDB Endow..

[34]  Ken Yocum,et al.  Scalable lineage capture for debugging DISC analytics , 2013, SoCC.

[35]  Kai Kunze,et al.  How much do you read?: counting the number of words a user reads using electrooculography , 2015, AH.