Data provenance management for bioinformatics workflows using NoSQL database systems in a cloud computing environment

Computer science solutions for molecular biology problems are often presented in the form of workflows. There is a set of activities performed by different processing entities through managed tasks. Knowledge about the data trajectory throughout a given workflow enables reproducibility by data provenance. In order to reproduce an in silico bioinformatics experiment one must consider other aspects besides those steps followed by a workflow. Indeed, the computational settings in which the involved programs run is a requirement for reproducibility. Cloud computing technology may hide the technical details and make it easier for the user to set up such an on-demand environment. NoSQL database systems have also gained popularity, particularly in the cloud. Considering this particular scenario, we have planned and executed a research study about a bioinformatics workflow running in an IaaS cloud computing environment. We have persisted provenance data according to the PROV-DM model, using different types of NoSQL database systems. We present in this paper some preliminary results from our reseach work, where we have explored the characteristics of several NoSQL database systems to persist provenance data.

[1]  Maristela Holanda,et al.  BioNimbuZ: A federated cloud platform for bioinformatics applications , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[2]  Maristela Holanda,et al.  Provenance in bioinformatics workflows , 2013, BMC Bioinformatics.

[3]  Alejandro Zunino,et al.  Persisting big-data: The NoSQL landscape , 2017, Inf. Syst..

[4]  Barrie Sosinsky,et al.  Cloud Computing Bible , 2010 .

[5]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[6]  Yan Guo,et al.  Three-stage quality control strategies for DNA re-sequencing data , 2014, Briefings Bioinform..

[7]  Christoph Bleidorn,et al.  Assembly and Data Quality , 2017 .

[8]  Daniel de Oliveira,et al.  Uso de SGBDs NoSQL na Gerência da Proveniência Distribuída em Workflows Científicos , 2014, SBBD.

[9]  Michael Ott,et al.  De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity , 2013 .

[10]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[11]  Amit P. Sheth,et al.  An overview of workflow management: From process modeling to workflow automation infrastructure , 1995, Distributed and Parallel Databases.

[12]  Yolanda Gil,et al.  PROV-DM: The PROV Data Model , 2013 .

[13]  Digumarti Bhaskara Rao World conference , 1988 .

[14]  Marta Mattoso,et al.  Capturing and querying workflow runtime provenance with PROV: a practical approach , 2013, EDBT '13.

[15]  S. D. Madhu Kumar,et al.  Capturing provenance for big data analytics done using SQL interface , 2015, 2015 IEEE UP Section Conference on Electrical Computer and Electronics (UPCON).

[16]  Konstantinos Krampis,et al.  Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community , 2012, BMC Bioinformatics.

[17]  Colin N. Dewey,et al.  De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis , 2013, Nature Protocols.

[18]  Barbara Sitek,et al.  A practical data processing workflow for multi-OMICS projects. , 2014, Biochimica et biophysica acta.

[19]  Jeffrey Heer,et al.  prefuse: a toolkit for interactive information visualization , 2005, CHI.

[20]  Vasa Curcin,et al.  Embedding data provenance into the Learning Health System to facilitate reproducible research , 2016, Learning health systems.

[21]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[22]  Lavanya Ramakrishnan,et al.  Milieu: Lightweight and Configurable Big Data Provenance for Science , 2013, 2013 IEEE International Congress on Big Data.

[23]  Maristela Holanda,et al.  Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency , 2015, International journal of genomics.

[24]  Clara Gaff,et al.  Cpipe: a shared variant detection pipeline designed for diagnostic settings , 2015 .

[25]  Barrie Sosinsky Cloud Computing Bible: Sosinsky/Cloud , 2010 .

[26]  Ling Liu,et al.  ProvenanceLens: Service provenance management in the cloud , 2014, 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[27]  Marta Mattoso,et al.  Towards a Taxonomy of Provenance in Scientific Workflow Management Systems , 2009, 2009 Congress on Services - I.

[28]  Rodrigo Da Rosa Righi Elasticidade em cloud computing: conceito, estado da arte e novos desafios , 2013 .

[29]  Richard O. Sinnott,et al.  Investigating reproducibility and tracking provenance – A genomic workflow case study , 2017, BMC Bioinformatics.

[30]  Manolis Kellis,et al.  Comparative Functional Genomics of the Fission Yeasts , 2011, Science.

[31]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .