When Cloud Computing Meets Bioinformatics: a Review

In the past decades, with the rapid development of high-throughput technologies, biology research has generated an unprecedented amount of data. In order to store and process such a great amount of data, cloud computing and MapReduce were applied to many fields of bioinformatics. In this paper, we first introduce the basic concepts of cloud computing and MapReduce, and their applications in bioinformatics. We then highlight some problems challenging the applications of cloud computing and MapReduce to bioinformatics. Finally, we give a brief guideline for using cloud computing in biology research.

[1]  David R. Riley,et al.  CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing , 2011, BMC Bioinformatics.

[2]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[3]  Günther Specht,et al.  Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds , 2012, BMC Bioinformatics.

[4]  Francisco Azuaje,et al.  Gene set analysis in the cloud , 2012, Bioinform..

[5]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[6]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[7]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[8]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Maria Jesus Martin,et al.  The European Bioinformatics Institute's data resources , 2003, Nucleic Acids Res..

[11]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[12]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[13]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[14]  Jeffrey T. Leek,et al.  Cloud-scale RNA-sequencing differential , 2010 .

[15]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[16]  Gianluigi Zanetti,et al.  Biodoop: Bioinformatics on Hadoop , 2009, 2009 International Conference on Parallel Processing Workshops.

[17]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[18]  Anton Nekrutenko,et al.  Galaxy CloudMan: delivering cloud compute clusters , 2010, BMC Bioinformatics.

[19]  Yongchao Liu,et al.  MSA-CUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[20]  Dawn Field,et al.  Open software for biologists: from famine to feast , 2006, Nature Biotechnology.

[21]  Stéphane Le Crom,et al.  Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses , 2012, Bioinform..

[22]  Robert Grossman,et al.  PeakRanger: A cloud-enabled peak caller for ChIP-seq data , 2011, BMC Bioinformatics.

[23]  Konstantinos Krampis,et al.  Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community , 2012, BMC Bioinformatics.

[24]  Alex Bateman,et al.  Cloud computing , 2009, Bioinform..

[25]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[26]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[27]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[28]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[29]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[30]  Joel T Dudley,et al.  In silico research in the era of cloud computing , 2010, Nature Biotechnology.

[31]  Jingfa Xiao,et al.  Bioinformatics clouds for big data manipulation , 2012, Biology Direct.

[32]  Ke Chen,et al.  Survey of MapReduce frame operation in bioinformatics , 2013, Briefings Bioinform..

[33]  Michael C. Schatz,et al.  Rapid parallel genome indexing with MapReduce , 2011, MapReduce '11.

[34]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[35]  Anantharaman Kalyanaraman,et al.  MapReduce implementation of a hybrid spectral library-database search method for large-scale peptide identification , 2011, Bioinform..

[36]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[37]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[38]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[39]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[40]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[41]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[42]  Anton Nekrutenko,et al.  Harnessing cloud computing with Galaxy Cloud , 2011, Nature Biotechnology.

[43]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[44]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[45]  Matthew S. Burriesci,et al.  Fulcrum: condensing redundant reads from high-throughput sequencing studies , 2012, Bioinform..

[46]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[47]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[48]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[49]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[50]  Peter J. Tonellato,et al.  Cloud computing for comparative genomics , 2010, BMC Bioinformatics.

[51]  R. Schoenfeld,et al.  Comparative Genomics of Listeria Species , 1976 .

[52]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[53]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[54]  D. Kell,et al.  Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. , 2004, BioEssays : news and reviews in molecular, cellular and developmental biology.

[55]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[56]  L. Gautier,et al.  Comparative Genomics of Listeria Species , 2001, Science.

[57]  Jin Soo Lee,et al.  FX: an RNA-Seq analysis tool on the cloud , 2012, Bioinform..

[58]  Hideaki Sugawara,et al.  DDBJ with new system and face , 2007, Nucleic Acids Res..

[59]  Weisong Shi,et al.  CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping , 2011, BMC Research Notes.

[60]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[61]  Divyakant Agrawal,et al.  eCEO: an efficient Cloud Epistasis cOmputing model in genome-wide association study , 2011, Bioinform..

[62]  E. Callaway Genome giant offers data service , 2011, Nature.