Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing

Background The widespread popularity of genomic applications is threatened by the “bioinformatics bottleneck” resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. Results We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Conclusions Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.

[1]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[2]  J. Neu,et al.  Succession of microbial consortia in the developing infant gut microbiome , 2011 .

[3]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[4]  David R. Riley,et al.  CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing , 2011, BMC Bioinformatics.

[5]  R. Knight,et al.  The Effect of Diet on the Human Gut Microbiome: A Metagenomic Analysis in Humanized Gnotobiotic Mice , 2009, Science Translational Medicine.

[6]  Damian Szklarczyk,et al.  eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations , 2009, Nucleic Acids Res..

[7]  Christian von Mering,et al.  eggNOG: automated construction and annotation of orthologous groups of genes , 2007, Nucleic Acids Res..

[8]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[9]  Kiyoko F. Aoki-Kinoshita,et al.  Gene annotation and pathway mapping in KEGG. , 2007, Methods in molecular biology.

[10]  Minoru Kanehisa,et al.  The KEGG database. , 2002, Novartis Foundation symposium.

[11]  M. Guyer,et al.  Charting a course for genomic medicine from base pairs to bedside , 2011, Nature.

[12]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[13]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[14]  Alex Bateman,et al.  Cloud computing , 2009, Bioinform..

[15]  Jonathan Crabtree,et al.  Ergatis: a web interface and scalable software system for bioinformatics workflows , 2010, Bioinform..

[16]  Nicole Rusk Torrents of sequence , 2011, Nature Methods.

[17]  James H. Bullard,et al.  The origin of the Haitian cholera outbreak strain. , 2011, The New England journal of medicine.

[18]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[19]  Amy L. McGuire,et al.  Personalized genomic information: preparing for the future of genetic medicine , 2010, Nature Reviews Genetics.

[20]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[21]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[22]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[23]  Joel T Dudley,et al.  In silico research in the era of cloud computing , 2010, Nature Biotechnology.

[24]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[25]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[26]  James R. White,et al.  CloVR-16S: Phylogenetic microbial community composition analysis based on 16S ribosomal RNA amplicon sequencing – standard operating procedure, version 1.0 , 2011 .

[27]  James R. White,et al.  CloVR-Metagenomics: Functional and taxonomic microbial community characterization from metagenomic whole-genome shotgun (WGS) sequences – standard operating procedure, version 1.0 , 2011 .

[28]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[31]  Samuel V. Angiuoli,et al.  The IGS Standard Operating Procedure for Automated Prokaryotic Annotation , 2011, Standards in genomic sciences.

[32]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[33]  T. Tatusova,et al.  NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2006, Nucleic Acids Research.

[34]  Peter F. Hallin,et al.  RNAmmer: consistent and rapid annotation of ribosomal RNA genes , 2007, Nucleic acids research.

[35]  Alexander A. Morgan,et al.  Translational bioinformatics in the cloud: an affordable alternative , 2010, Genome Medicine.

[36]  Rick L. Stevens,et al.  Functional metagenomic profiling of nine biomes , 2008, Nature.

[37]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[38]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[39]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[40]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[41]  Ben Shneiderman,et al.  Hawkeye: an interactive visual analytics tool for genome assemblies , 2007, Genome Biology.

[42]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[43]  Susumu Goto,et al.  KEGG for representation and analysis of molecular networks involving diseases and drugs , 2009, Nucleic Acids Res..

[44]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[45]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[46]  Dawn Field,et al.  Open software for biologists: from famine to feast , 2006, Nature Biotechnology.

[47]  Matthew Berriman,et al.  Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database , 2008, Bioinform..

[48]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[49]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[50]  Ryan C Bailey,et al.  Grand challenge commentary: Informative diagnostics for personalized medicine. , 2010, Nature chemical biology.

[51]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[52]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[53]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[54]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[55]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[56]  Tracy K. Teal,et al.  Systematic artifacts in metagenomes from complex microbial communities , 2009, The ISME Journal.

[57]  Andrew C. Stewart,et al.  DIYA: a bacterial annotation pipeline for any genomics lab , 2009, Bioinform..

[58]  Michelle G. Giglio,et al.  CloVR-Microbe: Assembly, gene finding and functional annotation of raw sequence data from single microbial genome projects – standard operating procedure, version 1.0 , 2011 .

[59]  Andreas Wilke,et al.  Using clouds for metagenomics: A case study , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.