Reproducible Bioconductor Workflows Using Browser-based Interactive Notebooks and Containers

Objective Bioinformatics publications typically include complex software workflows that are difficult to describe in a manuscript. We describe and demonstrate the use of interactive software notebooks to document and distribute bioinformatics research. We provide a user-friendly tool, BiocImageBuilder, that allows users to easily distribute their bioinformatics protocols through interactive notebooks uploaded to either a GitHub repository or a private server. Materials and methods We present four different interactive Jupyter notebooks using R and Bioconductor workflows to infer differential gene expression, analyze cross-platform datasets, process RNA-seq data and KinomeScan data. These interactive notebooks are available on GitHub. The analytical results can be viewed in a browser. Most importantly, the software contents can be executed and modified. This is accomplished using Binder, which runs the notebook inside software containers, thus avoiding the need to install any software and ensuring reproducibility. All the notebooks were produced using custom files generated by BiocImageBuilder. Results BiocImageBuilder facilitates the publication of workflows with a point-and-click user interface. We demonstrate that interactive notebooks can be used to disseminate a wide range of bioinformatics analyses. The use of software containers to mirror the original software environment ensures reproducibility of results. Parameters and code can be dynamically modified, allowing for robust verification of published results and encouraging rapid adoption of new methods. Conclusion Given the increasing complexity of bioinformatics workflows, we anticipate that these interactive software notebooks will become as necessary for documenting software methods as traditional laboratory notebooks have been for documenting bench protocols, and as ubiquitous.

[1]  S. Gygi,et al.  Mass spectrometry based method to increase throughput for kinome analyses using ATP probes. , 2013, Analytical chemistry.

[2]  I. Cockburn,et al.  The Economics of Reproducibility in Preclinical Research , 2015, PLoS biology.

[3]  Thomas J. S. Durant,et al.  Use of application containers and workflows for genomic data analysis , 2016, Journal of pathology informatics.

[4]  Ka Yee Yeung,et al.  Building containerized workflows using the BioDepot-workflow-builder (Bwb) , 2017, bioRxiv.

[5]  Gökhan Boyraz,et al.  Pregnancy of unknown location. , 2013, Journal of the Turkish German Gynecological Association.

[6]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Mary Goldman,et al.  Rapid and efficient analysis of 20,000 RNA-seq samples with Toil , 2016, bioRxiv.

[8]  Ka Yee Yeung,et al.  Software solutions for reproducible RNA-seq workflows , 2017, bioRxiv.

[9]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[10]  B. Woelfer,et al.  Expectant management of early pregnancies of unknown location: a prospective evaluation of methods to predict spontaneous resolution of pregnancy , 2001, BJOG : an international journal of obstetrics and gynaecology.

[11]  Adam A. Margolin,et al.  Addendum: The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity , 2012, Nature.

[12]  Karel Svoboda,et al.  Neural coding in barrel cortex during whisker-guided locomotion , 2015, eLife.

[13]  Ka Yee Yeung,et al.  GUIdock: Using Docker Containers with a Common Graphics User Interface to Address the Reproducibility of Research , 2016, PloS one.

[14]  D. Vidovic,et al.  Large-scale integration of small molecule-induced genome-wide transcriptional responses, Kinome-wide binding affinities and cell-growth inhibition profiles reveal global trends characterizing systems-level drug action , 2014, Front. Genet..

[15]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[16]  Robert Gentleman,et al.  Statistical Analyses and Reproducible Research , 2007 .

[17]  B. Granger Ipython: a System for Interactive Scientific Computing Python: an Open and General- Purpose Environment , 2007 .

[18]  H. Lawson,et al.  Surveillance for ectopic pregnancy--United States, 1970-1989. , 1993, MMWR. CDC surveillance summaries : Morbidity and mortality weekly report. CDC surveillance summaries.

[19]  Scott T. Weiss,et al.  RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells , 2014, PloS one.

[20]  Kurt Barnhart,et al.  Human Chorionic Gonadotropin Profile for Women With Ectopic Pregnancy , 2006, Obstetrics and gynecology.

[21]  Wolfgang Huber,et al.  RNA-Seq workflow: gene-level exploratory analysis and differential expression , 2015, F1000Research.

[22]  F van der Veen,et al.  The accuracy of single serum progesterone measurement in the diagnosis of ectopic pregnancy: a meta-analysis. , 1998, Human reproduction.

[23]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[24]  Nicolas Stransky,et al.  Targeting cancer with kinase inhibitors. , 2015, The Journal of clinical investigation.

[25]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[26]  Andrew W. Horne,et al.  Ectopic Pregnancy as a Model to Identify Endometrial Genes and Signaling Pathways Important in Decidualization and Regulated by Local Trophoblast , 2011, PloS one.

[27]  Thomas D. Wu,et al.  A comprehensive transcriptional portrait of human cancer cell lines , 2014, Nature Biotechnology.

[28]  D Timmerman,et al.  Pregnancies of unknown location: consensus statement , 2006, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[29]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Jeffrey T. Leek,et al.  Opinion: Reproducible research can still be wrong: Adopting a prevention approach , 2015, Proceedings of the National Academy of Sciences.

[31]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[32]  N Kadar,et al.  The discriminatory human chorionic gonadotropin zone for endovaginal sonography: a prospective, randomized study. , 1994, Fertility and sterility.

[33]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[34]  Jun Wang,et al.  Predicting Anticancer Drug Responses Using a Dual-Layer Integrated Cell Line-Drug Network Model , 2015, PLoS Comput. Biol..

[35]  K. Y. Yeung,et al.  GUIdock-VNC: using a graphical desktop sharing system to provide a browser-based interface for containerized software , 2017, GigaScience.

[36]  Tom Bourne,et al.  Predicting Outcomes in Pregnancies of Unknown Location , 2008, Women's health.

[37]  The Ligo Scientific Collaboration,et al.  GW151226: Observation of Gravitational Waves from a 22-Solar-Mass Binary Black Hole Coalescence , 2016, 1606.04855.

[38]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[39]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[40]  Laleh Soltan Ghoraie,et al.  A review of connectivity map and computational approaches in pharmacogenomics , 2017, Briefings Bioinform..

[41]  Vince Buffalo Bioinformatics Data Skills: Reproducible and Robust Research with Open Source Tools , 2015 .