Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

Although localization of Next-Generation Sequencing (NGS) data is suitable for many analysis and usage scenarios, it is not universally desirable, nor possible. However most solutions "impose" the localization of data as a pre-condition for NGS analytics. We analyze several existing tools and techniques that use MapReduce programming model for NGS data analysis to determine their effectiveness and extensibility to support distributed data scenarios. We find limitations at multiple levels. To overcome these limitations, we developed a Pilot-based MapReduce (PMR) -- which is a novel implementation of MapReduce using a Pilot task and data management implementation. PMR provides an effective means by which a variety of new or existing methods for NGS and downstream analysis can be carried out whilst providing efficiency and scalability across multiple clusters. Pilot-MapReduce (PMR) circumvents the use of global reduce and yet provides an effective, scalable and distributed solution for MapReduce programming model. We compare and contrast the PMR approach to similar capabilities of Seqal and Crossbow, two other tools which are based on conventional Hadoop-based MapReduce for NGS reads alignment and duplicate read removal or SNP finding, respectively. We find that PMR is a viable tool to support distributed NGS analytics, particularly providing a framework that supports parallelism at multiple levels.

[1]  Luca Pireddu,et al.  MapReducing a genomic sequencing workflow , 2011, MapReduce '11.

[2]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[3]  Shantenu Jha,et al.  Building gateways for life-science applications using the dynamic application runtime environment (DARE) framework , 2011, TG.

[4]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[5]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[6]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[7]  Shantenu Jha,et al.  Characterizing deep sequencing analytics using BFAST: towards a scalable distributed architecture for next-generation sequencing data , 2011, ECMLS '11.

[8]  Shantenu Jha,et al.  Understanding application-level interoperability: Scaling-out MapReduce over high-performance grids and clouds , 2011, Future Gener. Comput. Syst..

[9]  D. Zhu,et al.  Quantitative and Qualitative RNA-Seq-Based Evaluation of Epstein-Barr Virus Transcription in Type I Latency Burkitt's Lymphoma Cells , 2010, Journal of Virology.

[10]  Chenyu Wang,et al.  Exploring MapReduce efficiency with highly-distributed data , 2011, MapReduce '11.

[11]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[12]  John D McPherson,et al.  Next-generation gap , 2009, Nature Methods.

[13]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Bettina Schnor,et al.  Adaptive distributed replica–exchange simulations , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[16]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[17]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[18]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[19]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[20]  Douglas Thain,et al.  Taming complex bioinformatics workflows with weaver, makeflow, and starch , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[21]  Judy Qiu,et al.  A hierarchical framework for cross-domain MapReduce execution , 2011, ECMLS '11.

[22]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[23]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[24]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[25]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[26]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[27]  Alex Bateman,et al.  Bioinformatics for next generation sequencing. , 2009, Bioinformatics.

[28]  Shantenu Jha,et al.  Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data , 2012, MapReduce '12.