An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

BackgroundBioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce.DescriptionAn overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date.ConclusionsHadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms.

[1]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[2]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[3]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[4]  Jason Venner,et al.  Pro Hadoop , 2009 .

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[7]  Gudmundur A. Thorisson,et al.  Genotype–phenotype databases: challenges and solutions for the post-genomic era , 2009, Nature Reviews Genetics.

[8]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[9]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[10]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[11]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[12]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[13]  G. Sudha Sadasivam,et al.  A novel approach to multiple sequence alignment using hadoop data grids , 2010, MDAC '10.

[14]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15]  Gianluigi Zanetti,et al.  Biodoop: Bioinformatics on Hadoop , 2009, 2009 International Conference on Parallel Processing Workshops.

[16]  Christian M. Zmasek,et al.  phyloXML: XML for evolutionary biology and comparative genomics , 2009, BMC Bioinformatics.

[17]  Alfred E. Brenner,et al.  Moore's Law , 1997, Science.

[18]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[19]  Harald Schumny,et al.  Next generation , 2005, Comput. Stand. Interfaces.

[20]  Alexander V. Alekseyenko,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl647 Data and text mining Nested Containment List (NCList): a new algorithm , 2022 .

[21]  Monya Baker,et al.  Next-generation sequencing: adjusting to data overload , 2010, Nature Methods.

[22]  John A. Hules,et al.  National Energy Research Scientific Computing Center 2007 Annual Report , 2008 .

[23]  所 真理雄,et al.  20th ACM Symposium on Operating Systems Principles , 1986, SOSP '05.

[24]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[25]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[26]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[27]  Clare Sansom Up in a cloud? , 2010, Nature Biotechnology.

[28]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[29]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[30]  Carole A. Goble,et al.  Community-driven computational biology with Debian Linux , 2010, BMC Bioinformatics.

[31]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[32]  Paul T. Spellman,et al.  A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB , 2006, BMC Bioinformatics.

[33]  BMC Bioinformatics , 2005 .

[34]  M. Tomita,et al.  Pathway Projector: Web-Based Zoomable Pathway Browser Using KEGG Atlas and Google Maps API , 2009, PloS one.

[35]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[36]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[37]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[38]  Damian Smedley,et al.  XGAP: a uniform and extensible data model and software platform for genotype and phenotype experiments , 2010, Genome Biology.

[39]  Geoffrey C. Fox,et al.  Hybrid cloud and cluster computing paradigms for life science applications , 2010, BMC Bioinformatics.

[40]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[41]  G. Clark,et al.  Reference , 2008 .

[42]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[43]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[44]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[45]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[46]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[47]  Gathering clouds and a sequencing storm , 2010, Nature Biotechnology.