A Blockchain-Based Approach to Provenance and Reproducibility in Research Workflows

The traditional Proof of Existence blockchain service on the Bitcoin network can be used to verify the existence of any research data at a specific point of time, and to validate the data integrity, without revealing its content. Several variants of the blockchain service exist to certify the existence of data relying on cryptographic fingerprinting, thus enabling an efficient verification of the authenticity of such certifications. However, nowadays research data is continuously changing and being modified through different processing steps in most scientific research workflows such that certifications of individual data objects seem to be constantly outdated in this setting. This paper describes how the blockchain and distributed ledger technology can be used to form a new certification model, that captures the research process as a whole in a more meaningful way, including the description of the used data through its different stages and the associated computational pipeline, code for analysis and the experimental design. The scientific blockchain infrastructure bloxberg, together with a deep learning based analysis from the behavioral science field are used to show the applicability of the approach.

[1]  S. Nakamoto,et al.  Bitcoin: A Peer-to-Peer Electronic Cash System , 2008 .

[2]  Anneke Zuiderwijk,et al.  Sharing and re-using open data: A case study of motivations in astrophysics , 2019, Int. J. Inf. Manag..

[3]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[4]  Michael C. Frank,et al.  Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition , 2018, Royal Society Open Science.

[5]  S. Ebrahim,et al.  Data dredging, bias, or confounding , 2002, BMJ : British Medical Journal.

[6]  Paul Watson,et al.  Sharing and performance optimization of reproducible workflows in the cloud , 2019, Future Gener. Comput. Syst..

[7]  David B. Allison,et al.  Inappropriate Fiddling with Statistical Analyses to Obtain a Desirable P-value: Tests to Detect its Presence in Published Literature , 2012, PloS one.

[8]  Norbert Pohlmann,et al.  Integrating bloxberg's Proof of Existence Service With MATLAB , 2020, Frontiers in Blockchain.

[9]  Martin Stoffers,et al.  Trustworthy Provenance Recording using a blockchain-like database , 2017 .

[10]  John H. Hartman,et al.  The Swarm scalable storage system , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[11]  Vitalik Buterin A NEXT GENERATION SMART CONTRACT & DECENTRALIZED APPLICATION PLATFORM , 2015 .

[12]  Néhémie Strupler,et al.  Reproducibility in the Field: Transparency, Version Control and Collaboration on the Project Panormos Survey , 2017 .

[13]  Pjotr Prins,et al.  Scalable Workflows and Reproducible Data Analysis for Genomics , 2019, Methods in molecular biology.

[14]  Nikolaus Augsten,et al.  A Link is not Enough – Reproducibility of Data , 2019, Datenbank-Spektrum.

[15]  Nai Fovino Igor,et al.  Blockchain now and tomorrow : assessing multidimensional impacts of distributed ledger technologies , 2019, EUR (Luxembourg. Online).

[16]  Cesare Furlanello,et al.  Towards a scientific blockchain framework for reproducible data analysis , 2017, ArXiv.

[17]  Mihhail Matskin,et al.  Scalable Execution of Big Data Workflows using Software Containers , 2020, MEDES.

[18]  I. Cockburn,et al.  The Economics of Reproducibility in Preclinical Research , 2015, PLoS biology.

[19]  Sönke Bartling,et al.  Blockchain for Science and Knowledge Creation , 2018, Gesundheit digital.

[20]  John Domingue,et al.  The Blockchain and Kudos: A Distributed System for Educational Record, Reputation and Reward , 2016, EC-TEL.

[21]  Jens Grabowski,et al.  Dynamic Management of Multi-level-simulation Workflows in the Cloud , 2019 .

[22]  I. Hrynaszkiewicz Publishers' Responsibilities in Promoting Data Quality and Reproducibility. , 2019, Handbook of experimental pharmacology.

[23]  Matthew Stephens,et al.  Creating and sharing reproducible research code the workflowr way , 2019, F1000Research.

[24]  Andreas Schreiber,et al.  Enabling a Conceptual Data Model and Workflow Integration Environment for Concurrent Launch Vehicle Analysis , 2018 .

[25]  Evan Duffield,et al.  Dash: A Privacy-Centric Crypto-Currency , 2017 .

[26]  Aisha Zahid Junejo,et al.  Applications of Blockchain Technology in Medicine and Healthcare: Challenges and Future Perspectives , 2019, Cryptogr..

[27]  Pierre-Antoine Champin,et al.  JSON-LD 1.1 – A JSON-based Serialization for Linked Data , 2019 .

[28]  Fernanda Campos,et al.  Integrating Blockchain for Data Sharing and Collaboration Support in Scientific Ecosystem Platform , 2021, HICSS.

[29]  Stephan Druskat,et al.  Software and Dependencies in Research Citation Graphs , 2019, Computing in Science & Engineering.

[30]  Tsuyoshi Miyakawa,et al.  No raw data, no science: another possible source of the reproducibility crisis , 2020, Molecular Brain.

[31]  Kevin M. Cury,et al.  DeepLabCut: markerless pose estimation of user-defined body parts with deep learning , 2018, Nature Neuroscience.

[32]  Caslav Ilic,et al.  Overview of collaborative high performance computing-based MDO of transport aircraft in the DLR project VicToria , 2018 .

[33]  José Maria N. David,et al.  Blockchain for Reliability in Collaborative Scientific Workflows on Cloud Platforms , 2020, 2020 IEEE Symposium on Computers and Communications (ISCC).

[34]  Robert Mischke,et al.  RCE: An Integration Environment for Engineering and Science , 2019, SoftwareX.

[35]  J. Ioannidis,et al.  Systematic Review of the Empirical Evidence of Study Publication Bias and Outcome Reporting Bias , 2008, PloS one.

[36]  Qingfeng Meng,et al.  Towards Secure and Efficient Scientific Research Project Management Using Consortium Blockchain , 2020, Journal of Signal Processing Systems.

[37]  Rajkumar Buyya,et al.  Multiple Workflows Scheduling in Multi-tenant Distributed Systems , 2018, ACM Comput. Surv..

[38]  Xiao-Feng Zhang Application of Blockchain Technology in Data Management of University Scientific Research , 2020, IMIS.

[39]  Jarek Nabrzyski,et al.  Certifying Provenance of Scientific Datasets with Self-sovereign Identity and Verifiable Credentials , 2020, ArXiv.

[40]  Wei Jeng,et al.  A decentralized framework for cultivating research lifecycle transparency , 2020, PloS one.

[41]  Julita Vassileva,et al.  User Data Sharing Frameworks: A Blockchain-Based Incentive Solution , 2019, 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON).

[42]  Joris van Rossum Blockchain for Research , 2017 .

[43]  R. Lanfear,et al.  The Extent and Consequences of P-Hacking in Science , 2015, PLoS biology.

[44]  Tim K. Mackey,et al.  A Framework Proposal for Blockchain-Based Scientific Publishing Using Shared Governance , 2019, Frontiers in Blockchain.

[45]  Wojciech Świątkowski,et al.  Replicability Crisis in Social Psychology: Looking at the Past to Find New Pathways for the Future , 2017 .

[46]  Massimiliano Izzo,et al.  FAIRsharing as a community approach to standards, repositories and policies , 2019, Nature Biotechnology.

[47]  Victoria L. Lemieux,et al.  Trusting records: is Blockchain technology the answer? , 2016 .

[48]  Jochen Marzi,et al.  A Holistic Approach to Ship Design: Tools and Applications , 2019 .

[49]  Karthik Ram,et al.  Git can facilitate greater reproducibility and increased transparency in science , 2013, Source Code for Biology and Medicine.

[50]  Moritz Schubotz,et al.  A decentralized method for making sensor measurements tamper-proof to support open science applications , 2019, ArXiv.

[51]  Peter Weiand,et al.  Process Development for Integrated and Distributed Rotorcraft Design , 2019 .

[52]  Jacek Kitowski,et al.  Reproducibility of Computational Experiments on Kubernetes-Managed Container Clouds with HyperFlow , 2020, ICCS.

[53]  Witold M. Hensel Double trouble? The communication dimension of the reproducibility crisis in experimental psychology and neuroscience , 2020, European Journal for Philosophy of Science.

[54]  Bartosz Balis,et al.  Serverless Containers - rising viable approach to Scientific Workflows , 2020, ArXiv.

[55]  Michael L. Nelson,et al.  Archive Assisted Archival Fixity Verification Framework , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[56]  Andrew Forbes,et al.  Bias due to selective inclusion and reporting of outcomes and analyses in systematic reviews of randomised trials of healthcare interventions. , 2014, The Cochrane database of systematic reviews.

[57]  Guanhua Yan,et al.  SciBlock: A Blockchain-Based Tamper-Proof Non-Repudiable Storage for Scientific Workflow Provenance , 2019, 2019 IEEE 5th International Conference on Collaboration and Internet Computing (CIC).

[58]  Moritz Schubotz,et al.  Securing the Integrity of Time Series Data in Open Science Projects using Blockchain-based Trusted Timestamping , 2019 .

[59]  Victor I. Chang,et al.  A Proposed Solution and Future Direction for Blockchain-Based Heterogeneous Medicare Data in Cloud Environment , 2018, Journal of Medical Systems.