Fast de Bruijn Graph Compaction in Distributed Memory Environments

De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is <inline-formula><tex-math notation="LaTeX">$3.7\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>3</mml:mn><mml:mo>.</mml:mo><mml:mn>7</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="pan-ieq1-2858797.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$2.0\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>0</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="pan-ieq2-2858797.gif"/></alternatives></inline-formula> faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at <uri>https://github.com/ParBLiSS/bruno</uri>.

[1]  P. Flick,et al.  Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems , 2016, BCB.

[2]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[3]  Leonid Oliker,et al.  HipMer: an extreme-scale de novo genome assembler , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Pavan Balaji,et al.  SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores , 2014, BMC Bioinformatics.

[5]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[6]  Srinivas Aluru,et al.  Parallel de novo assembly of large genomes from high-throughput short reads , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7]  Eugene W. Myers,et al.  Computability of Models for Sequence Assembly , 2007, WABI.

[8]  Sanguthevar Rajasekaran,et al.  Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs , 2010, BMC Bioinformatics.

[9]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[10]  Margaret Reid-Miller,et al.  List ranking and list scan on the Cray C-90 , 1994, SPAA '94.

[11]  Leonid Oliker,et al.  Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[13]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[14]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[15]  Jintao Meng,et al.  Improved Parallel Processing of Massive De Bruijn Graph for Genome Assembly , 2013, APWeb.

[16]  James Christopher Wyllie,et al.  The Complexity of Parallel Computations , 1979 .

[17]  Jop F. Sibeyn,et al.  Practical Parallel List Ranking , 1997, J. Parallel Distributed Comput..

[18]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[19]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[21]  Frank Dehne,et al.  Randomized parallel list ranking for distributed memory multiprocessors , 1996, International Journal of Parallel Programming.

[22]  Yongchao Liu,et al.  Parallelized short read assembly of large genomes using de Bruijn graphs , 2011, BMC Bioinformatics.