A cost-effective approach to improving performance of big genomic data analyses in clouds

Abstract With the rapidly growing demand for DNA analysis, the need for storing and processing large-scale genome data has presented significant challenges. This paper describes how the Genome Analysis Toolkit (GATK) can be deployed to an elastic cloud, and defines policy to drive elastic scaling of the application. We extensively analyse the GATK to expose opportunities for resource elasticity, demonstrate that it can be practically deployed at scale in a cloud environment, and demonstrate that applying elastic scaling improves the performance to cost tradeoff achieved in a simulated environment.

[1]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[2]  Eija Korpelainen,et al.  SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop , 2013, Bioinform..

[3]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[4]  Manish Parashar,et al.  CometCloud: An Autonomic Cloud Engine , 2011, CloudCom 2011.

[5]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[6]  Francisco Azuaje,et al.  Gene set analysis in the cloud , 2012 .

[7]  Daniel Moldovan,et al.  Multi-level Elasticity Control of Cloud Services , 2013, ICSOC.

[8]  S. Gabriel,et al.  Discovery and saturation analysis of cancer genes across 21 tumor types , 2014, Nature.

[9]  Douglas Thain,et al.  Adapting bioinformatics applications for heterogeneous systems: a case study , 2011, ECMLS '11.

[10]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[11]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[12]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[13]  Daniel Moldovan,et al.  ADVISE - A Framework for Evaluating Cloud Service Elasticity Behavior , 2014, ICSOC.

[14]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[15]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[16]  Lin Yang,et al.  Content-based histopathology image retrieval using CometCloud , 2014, BMC Bioinformatics.

[17]  Alan Mycroft,et al.  Ypnos: declarative, parallel structured grid programming , 2010, DAMP '10.

[18]  Jeffrey S. Chase,et al.  Automated control for elastic storage , 2010, ICAC '10.

[19]  Douglas Thain,et al.  Work Queue + Python: A Framework For Scalable Scientific Ensemble Applications , 2011 .

[20]  Marios D. Dikaiakos,et al.  JCatascopia: Monitoring Elastically Adaptive Applications in the Cloud , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[21]  Douglas Thain,et al.  Scaling Up Bioinformatics Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow , 2015, 2015 IEEE 11th International Conference on e-Science.

[22]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[23]  Jin Soo Lee,et al.  FX: an RNA-Seq analysis tool on the cloud , 2012, Bioinform..

[24]  M. Zaharia,et al.  A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples , 2014, Genome Research.

[25]  Douglas Thain,et al.  Accelerating Comparative Genomics Work ows in a Distributed Environment with Optimized Data Partitioning and Workflow Fusion , 2015, Scalable Comput. Pract. Exp..

[26]  Douglas Thain,et al.  Accelerating Comparative Genomics Workflows in a Distributed Environment with Optimized Data Partitioning , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[27]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[28]  Brian Beckman,et al.  LINQ: reconciling object, relations and XML in the .NET framework , 2006, SIGMOD Conference.

[29]  Tomislav Lipic,et al.  Delivering bioinformatics MapReduce applications in the cloud , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[30]  Douglas Thain,et al.  Toward fine-grained online task characteristics estimation in scientific workflows , 2013, WORKS@SC.

[31]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[32]  David A. Patterson,et al.  SCADS: Scale-Independent Storage for Social Computing Applications , 2009, CIDR.