Superbubbles are a class of induced subgraphs in digraphs that play an essential role in assembly algorithms for high-throughput sequencing data. They are connected with the remainder of the host digraph by a single entrance and a single exit vertex. Linear-time algorithms for the enumeration superbubbles recently have become available. Current approaches require the decomposition of the input digraph into strongly-connected components, which are then analyzed separately. In principle, a single depth-first search could be used, provided one can guarantee that the root of the depth-first search (DFS)-tree is not itself located in the interior or the exit point of a superbubble. Here, we describe a linear-time algorithm to determine suitable roots for a DFS-forest that is guaranteed to identify the superbubbles in a digraph correctly. In addition to the advantages of a more straightforward implementation, we observe a nearly three-fold gain in performance on real-world datasets. We present a reference implementation of the new algorithm that accepts many commonly-used input formats for digraphs. It is available as open source from github.
[1]
Tetsuo Shibuya,et al.
An $\bm{O(m\, \log\, m)}$ -Time Algorithm for Detecting Superbubbles
,
2015,
IEEE/ACM Transactions on Computational Biology and Bioinformatics.
[2]
Peter F. Stadler,et al.
Coordinate systems for supergenomes
,
2018,
Algorithms for Molecular Biology.
[3]
Christos A. Ouzounis,et al.
Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment
,
2017,
Biosyst..
[4]
Costas S. Iliopoulos,et al.
Theoretical Linear-time superbubble identification algorithm for genome assembly
,
2015
.
[5]
Benedict Paten,et al.
Superbubbles, Ultrabubbles, and Cacti
,
2018,
J. Comput. Biol..
[6]
Robert E. Tarjan,et al.
Depth-First Search and Linear Graph Algorithms
,
1972,
SIAM J. Comput..
[7]
M. Pop,et al.
The Theory and Practice of Genome Sequence Assembly.
,
2015,
Annual review of genomics and human genetics.