Automatic Generation of Provenance Metadata during Execution of Scientific Workflows

Data processing in data intensive scientific fields like bioinformatics is automated to a great extent. Among others, automation is achieved with workflow engines that execute an explicitly stated sequence of computations. Scientists can use these workflows through science gateways or they develop them by their own. In both cases they may have to preprocess their raw data and also may want to further process the workflow output. The scientist has to take care about provenance of the whole data processing pipeline. This is not a trivial task due to the diverse set of computational tools and environments used during the transformation of raw data to the final results. Thus we created a metadata schema to provide provenance for data processing pipelines and implemented a tool that creates this metadata during the execution of typical scientific computations. Provenance, Reproducibility, Workflows, Science Gateways—

[1]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[2]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[3]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[4]  Lukas Zimmermann,et al.  Maintaining a Science Gateway - Lessons Learned from MoSGrid , 2017, HICSS.

[5]  Stuart E. Madnick,et al.  Measuring Data Believability: A Provenance Approach , 2007, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[6]  Dennis Shasha,et al.  ReproZip: Computational Reproducibility With Ease , 2016, SIGMOD Conference.

[7]  Juliana Freire,et al.  noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts , 2017, Proc. VLDB Endow..

[8]  Peter Buneman,et al.  Data provenance – the foundation of data quality , 2010 .

[9]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[10]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[11]  Thomas Steinke,et al.  The MoSGrid Science Gateway - A Complete Solution for Molecular Simulations. , 2014, Journal of chemical theory and computation.

[12]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[13]  Sven Rahmann,et al.  Genome analysis , 2022 .

[14]  John Chilton,et al.  Common Workflow Language, v1.0 , 2016 .

[15]  Paul T. Groth,et al.  The rationale of PROV , 2015, J. Web Semant..

[16]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.