Lazer: Distributed memory-efficient assembly of large-scale genomes

Genome sequencing technology has witnessed tremendous progress in terms of throughput as well as cost per base pair, resulting in an explosion in the size of data. Consequently, typical sequence assembly tools demand a lot of processing power and memory and are unable to assemble big datasets unless run on hundreds of nodes. In this paper, we present a distributed assembler that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (452 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours.

[1]  Carl Hewitt,et al.  A Universal Modular ACTOR Formalism for Artificial Intelligence , 1973, IJCAI.

[2]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[3]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[4]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[5]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[6]  Yongchao Liu,et al.  Parallelized short read assembly of large genomes using de Bruijn graphs , 2011, BMC Bioinformatics.

[7]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[8]  Jan-Ming Ho,et al.  De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[9]  Yang Li,et al.  Memory Efficient Minimum Substring Partitioning , 2013, Proc. VLDB Endow..

[10]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[11]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[12]  Rahul Shah,et al.  MapReduce based parallel suffix tree construction for human genome , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[13]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[14]  Pavan Balaji,et al.  SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores , 2014, BMC Bioinformatics.

[15]  Seung-Jong Park,et al.  Evaluating different distributed-cyber-infrastructure for data and compute intensive scientific application , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[16]  Ümit V. Çatalyürek,et al.  Spaler: Spark and GraphX based de novo genome assembler , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[17]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[18]  Seung-Jong Park,et al.  Hadoop‐based replica exchange over heterogeneous distributed cyberinfrastructures , 2017, Concurr. Comput. Pract. Exp..