A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wallclock time for complete assembly differed by 52.9% (95%CI: 27.5-78.2) for E.coli and 53.5% (95%CI: 34.4-72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95%CI: 211.5-303.1) and 173.9% (95%CI: 134.6-213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE. INTRODUCTION Through the application of high-throughput sequencing, there has been a dramatic increase in the availability of large-scale genomic datasets.[1] With reducing sequencing costs, small and medium-sized laboratories can now easily amass many gigabytes of data. Given this dramatic increase in the volume of data generated, researchers are being forced to seek efficient and cost-effective measures for computational analysis.[2] Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity, without concern about its physical structure or ongoing maintenance.[3-6] However, transitioning to a cloud environment presents with unique strategic decisions,[7] and although a number of general benchmarking results are available (http://serverbear.com/benchmarks/cloud; https://cloudharmony.com/), there has been a paucity of comparisons of cloud computing services specifically for genomic
[1]
David E. Culler,et al.
The ganglia distributed monitoring system: design, implementation, and experience
,
2004,
Parallel Comput..
[2]
Dawei Li,et al.
The diploid genome sequence of an Asian individual
,
2008,
Nature.
[3]
M. Schatz,et al.
Searching for SNPs with cloud computing
,
2009,
Genome Biology.
[4]
D. Parkhomchuk,et al.
Use of high throughput sequencing to observe genome dynamics at a single cell level
,
2009,
Proceedings of the National Academy of Sciences.
[5]
Jorge-Arnulfo Quiané-Ruiz,et al.
Runtime measurements in the cloud
,
2010,
Proc. VLDB Endow..
[6]
Michael C. Schatz,et al.
Cloud Computing and the DNA Data Race
,
2010,
Nature Biotechnology.
[7]
G. Nolan,et al.
Computational solutions to large-scale data management and analysis
,
2010,
Nature Reviews Genetics.
[8]
L. Stein.
The case for cloud computing in genome informatics
,
2010,
Genome Biology.
[9]
Samuel V. Angiuoli,et al.
Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing
,
2011,
PloS one.
[10]
Peter J. Tonellato,et al.
Biomedical Cloud Computing With Amazon Web Services
,
2011,
PLoS Comput. Biol..
[11]
John Chilton,et al.
Implementation of Cloud based Next Generation Sequencing data analysis in a clinical laboratory
,
2013,
BMC Research Notes.
[12]
V. Marx.
Biology: The big challenges of big data
,
2013,
Nature.
[13]
Vivien Marx.
Genomics in the clouds
,
2013,
Nature Methods.
[14]
Nadia Drake,et al.
Cloud computing beckons scientists
,
2014,
Nature.
[15]
Rob Patro,et al.
Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms
,
2013,
Nature Biotechnology.