Identifying impact of software dependencies on replicability of biomedical workflows

Complex data driven experiments form the basis of biomedical research. Recent findings warn that the context in which the software is run, that is the infrastructure and the third party dependencies, can have a crucial impact on the final results delivered by a computational experiment. This implies that in order to replicate the same result, not only the same data must be used, but also it must be run on an equivalent software stack. In this paper we present the VFramework that enables assessing replicability of workflows. It identifies whether any differences in software dependencies among two executions of the same workflow exist and whether they have impact on the produced results. We also conduct a case study in which we investigate the impact of software dependencies on replicability of Taverna workflows used in biomedical research of Huntington's disease. We re-execute analysed workflows in environments differing in operating system distribution and configuration. The results show that the VFramework can be used to identify the impact of software dependencies on the replicability of biomedical workflows. Furthermore, we observe that despite the fact that the workflows are executed in a controlled environment, they still depend on specific tools installed in the environment. The context model used by the VFramework improves the deficiencies of provenance traces and documents also such tools. Based on our findings we define guidelines for workflow owners that enable them to improve replicability of their workflows.

[1]  Ilias Maglogiannis,et al.  A Collaborative Biomedical Image-Mining Framework: Application on the Image Analysis of Microscopic Kidney Biopsies , 2013, IEEE Journal of Biomedical and Health Informatics.

[2]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[3]  Barend Mons,et al.  Multidisciplinary Collaboration to Facilitate Hypotheses Generation in Huntington's Disease , 2015, 2015 IEEE 11th International Conference on e-Science.

[4]  Andreas Rauber,et al.  The Applicability of Workflow Management Systems for the Preservation of Business Processes , 2012, iPRES.

[5]  Maria-Eugenia Iacob,et al.  ArchiMate 2.0 Specification , 2012 .

[6]  Andrew P. Davison Automated Capture of Experiment Context for Easier Reproducibility in Computational Research , 2012, Computing in Science & Engineering.

[7]  Tijs Rademakers,et al.  Activiti in Action: Executable business processes in BPMN 2.0 , 2012 .

[8]  Giovanni Coppola,et al.  The HDAC inhibitor 4b ameliorates the disease phenotype and transcriptional abnormalities in Huntington's disease transgenic mice , 2008, Proceedings of the National Academy of Sciences.

[9]  Tomasz Miksa,et al.  Resilient Web Services for Timeless Business Processes , 2014, iiWAS.

[10]  Philip J. Guo CDE: Run Any Linux Application On-Demand Without Installation , 2011, LISA.

[11]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[12]  Paul A. Harris,et al.  Secondary use of clinical data: The Vanderbilt approach , 2014, J. Biomed. Informatics.

[13]  Tomasz Miksa,et al.  VPlan - Ontology for Collection of Process Verification Data , 2014, iPRES.

[14]  José Luis Borbinha,et al.  Using ontologies to capture the semantics of a (business) process for digital preservation , 2015, International Journal on Digital Libraries.

[15]  J. Gusella,et al.  De novo expansion of a (CAG)n repeat in sporadic Huntington's disease , 1993, Nature genetics.

[16]  Carole A. Goble,et al.  Fostering Scientific Workflow Preservation through Discovery of Substitute Services , 2011, 2011 IEEE Seventh International Conference on eScience.

[17]  Andreas Rauber,et al.  Process Migration Framework - Virtualising and Documenting Business Processes , 2014, 2014 IEEE 18th International Enterprise Distributed Object Computing Conference Workshops and Demonstrations.

[18]  Hugh G. Gauch,et al.  Scientific Method in Brief: Case studies , 2012 .

[19]  Carole A. Goble,et al.  Using a suite of ontologies for preserving workflow-centric research objects , 2015, J. Web Semant..

[20]  J. Olson,et al.  Regional and cellular gene expression changes in human Huntington's disease brain. , 2006, Human molecular genetics.

[21]  C. Drummond Replicability is not Reproducibility:Nor is it Good Science , 2009 .

[22]  Peter A. C. 't Hoen,et al.  Literature-aided interpretation of gene expression data with the weighted global test , 2011, Briefings Bioinform..

[23]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[24]  Tomasz Miksa,et al.  Ensuring sustainability of web services dependent processes , 2015, Int. J. Comput. Sci. Eng..

[25]  Erik Schultes,et al.  Nanopublications for exposing experimental data in the life-sciences: a Huntington’s Disease case study , 2015, Journal of Biomedical Semantics.

[26]  Manish S. Shah,et al.  A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes , 1993, Cell.

[27]  Carole A. Goble,et al.  Why workflows break — Understanding and combating decay in Taverna workflows , 2012, 2012 IEEE 8th International Conference on E-Science.

[28]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[29]  Dieter Van Uytvanck,et al.  Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use , 2016, Bull. IEEE Tech. Comm. Digit. Libr..

[30]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[31]  A. Curry,et al.  Rescue of old data offers lesson for particle physicists. , 2011, Science.

[32]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[33]  Cláudio T. Silva,et al.  Provenance for Visualizations: Reproducibility and Beyond , 2007, Computing in Science & Engineering.

[34]  Ron Mengelers,et al.  The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements , 2012, PloS one.

[35]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[36]  Andreas Rauber,et al.  A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows , 2015, 2015 IEEE 11th International Conference on e-Science.

[37]  M. Schuemie,et al.  Anni 2.0: a multipurpose text-mining tool for the life sciences , 2008, Genome Biology.

[38]  Tomasz Miksa,et al.  Framework for Verification of Preserved and Redeployed Processes , 2013, iPRES.

[39]  Roberto Di Cosmo,et al.  A modular package manager architecture , 2013, Inf. Softw. Technol..