论文信息 - Halvade: scalable sequence analysis with MapReduce

Halvade: scalable sequence analysis with MapReduce

Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR. Its source is available at http://bioinformatics.intec.ugent.be/halvade under GPL license. Contact: jan.fostier@intec.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online.

[1] Nuno A. Fonseca,et al. Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[2] Ian T. Foster,et al. Supercomputing for the parallelization of whole genome analysis , 2014, Bioinform..

[3] Eija Korpelainen,et al. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[4] S. Rhee,et al. TAIR: a resource for integrated Arabidopsis data , 2002, Functional & Integrative Genomics.

[5] Ruiqiang Li,et al. SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[6] M. DePristo,et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[7] Mauricio O. Carneiro,et al. From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[8] Cole Trapnell,et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[9] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[10] Elizabeth M. Smigielski,et al. dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[11] M. DePristo,et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[12] R. Chiodini,et al. The impact of next-generation sequencing on genomics. , 2011, Journal of genetics and genomics = Yi chuan xue bao.

[13] M. Schatz,et al. Searching for SNPs with cloud computing , 2009, Genome Biology.

[14] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15] Aaron R. Quinlan,et al. BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[16] Christian Schlötterer,et al. DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster , 2013, PloS one.

[17] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[18] Michael C. Schatz,et al. CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[19] Joshua S. Paul,et al. Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.