SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores

BackgroundThere is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data.ResultsThis paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets.ConclusionsIn this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler

[1]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[2]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[3]  John D McPherson,et al.  Next-generation gap , 2009, Nature Methods.

[4]  Srinivas Aluru,et al.  Parallel Construction of Bidirected String Graphs for Genome Assembly , 2008, 2008 37th International Conference on Parallel Processing.

[5]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[7]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[8]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[9]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[10]  Mihai Pop,et al.  Genome Sequence Assembly: Algorithms and Issues , 2002, Computer.

[11]  J. Huisman The Netherlands , 1996, The Lancet.

[12]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[13]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[14]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[15]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[16]  Srinivas Aluru,et al.  Parallel de novo assembly of large genomes from high-throughput short reads , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[17]  Siu-Ming Yiu,et al.  SOAP3: ultra-fast GPU-based parallel alignment tool for short reads , 2012, Bioinform..

[18]  Yu Xue,et al.  MBA: a literature mining system for extracting biomedical abbreviations , 2009, BMC Bioinformatics.

[19]  Evgeny Kapun,et al.  De Bruijn Superwalk with Multiplicities Problem is NP-hard , 2013, BMC Bioinformatics.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[22]  Srinivas Aluru,et al.  Parallel short sequence assembly of transcriptomes , 2009, BMC Bioinformatics.

[23]  Frank Dehne,et al.  Randomized parallel list ranking for distributed memory multiprocessors , 1996, International Journal of Parallel Programming.

[24]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[25]  Peter J. Tonellato,et al.  Cloud computing for comparative genomics , 2010, BMC Bioinformatics.

[26]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[27]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[28]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[29]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[30]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[31]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[32]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[33]  Yongchao Liu,et al.  Parallelized short read assembly of large genomes using de Bruijn graphs , 2011, BMC Bioinformatics.

[34]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[35]  Yanjie Wei,et al.  Small World Asynchronous Parallel Model for Genome Assembly , 2012, NPC.

[36]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.