BiobankCloud: A Platform for the Secure Storage, Sharing, and Processing of Large Biomedical Data Sets

Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientific workflow engine featuring a proper workflow definition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataflows. Our platform also supports the secure sharing of data across different, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-defined workflows for popular tasks in biomedical data analysis, such as variant identification, differential transcriptome analysis using RNA-Seq, and analysis of miRNA-Seq and ChIP-Seq data.

[1]  Alysson Neves Bessani,et al.  E-biobanking: What Have You Done to My Cell Samples? , 2013, IEEE Security & Privacy.

[2]  Jim Dowling,et al.  Scaling HDFS with a Strongly Consistent Relational Model for Metadata , 2014, DAIS.

[3]  M Hummel,et al.  PAX5 overexpression is not enough to reestablish the mature B-cell phenotype in classical Hodgkin lymphoma , 2014, Leukemia.

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[6]  Fabrício F. Costa Big data in biomedicine. , 2014, Drug discovery today.

[7]  Lisa Thalheim,et al.  Point mutation analysis of four human colorectal cancer exomes , 2012 .

[8]  Jim Dowling,et al.  SAASFEE: Scalable Scientific Workflow Execution Engine , 2015, Proc. VLDB Endow..

[9]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[10]  Mikael Ronström,et al.  Recovery Principles in MySQL Cluster 5.1 , 2005, VLDB.

[11]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[12]  Zhihai Ma,et al.  In-Depth Characterization of microRNA Transcriptome in Melanoma , 2013, PloS one.

[13]  Melanie Swan,et al.  The Quantified Self: Fundamental Disruption in Big Data Science and Biological Discovery , 2013, Big Data.

[14]  Chen Xu,et al.  Optimistic Recovery for Iterative Dataflows in Action , 2015, SIGMOD Conference.

[15]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[16]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[17]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[18]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[19]  Erwin Laure,et al.  Privacy Threat Modeling for Emerging BiobankClouds , 2014, EUSPN/ICTH.

[20]  Miguel Correia,et al.  DepSky: Dependable and Secure Storage in a Cloud-of-Clouds , 2013, TOS.

[21]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[22]  M. Janitz Next-generation genome sequencing : towards personalized medicine , 2008 .

[23]  Ulf Leser,et al.  Cuneiform: a Functional Language for Large Scale Scientific Data Analysis , 2015, EDBT/ICDT Workshops.

[24]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[25]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[26]  Eija Korpelainen,et al.  SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop , 2013, Bioinform..

[27]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[28]  Ulf Leser,et al.  Parallelization in Scientific Workflow Management Systems , 2013, ArXiv.

[29]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[30]  Jan Fostier,et al.  Halvade: scalable sequence analysis with MapReduce , 2015, Bioinform..

[31]  R. Weissleder,et al.  Imaging in the era of molecular oncology , 2008, Nature.

[32]  Jim Dowling,et al.  A security framework for population-scale genomics analysis , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[33]  Jim Dowling,et al.  Leader Election Using NewSQL Database Systems , 2015, DAIS.

[34]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).