Theoretical Linear-time superbubble identification algorithm for genome assembly

DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual's genome in order to catalogue sequence variation and understand its biological implications. Whole-genome sequencing techniques produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly algorithms utilise de Bruijn graphs constructed from reads for this purpose. A critical step of these algorithms is to detect typical motif structures in the graph caused by sequencing errors and genome repeats, and filter them out; one such complex subgraph class is a so-called superbubble. In this paper, we propose an O ( n + m ) -time algorithm to detect all superbubbles in a directed acyclic graph with n vertices and m (directed) edges, improving the best-known O ( m log ? m ) -time algorithm by Sung et al.

[1]  Stephane Durocher,et al.  A Simple Linear-Space Data Structure for Constant-Time Range Minimum Query , 2011, Space-Efficient Data Structures, Streams, and Algorithms.

[2]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[3]  Robert E. Tarjan,et al.  Edge-disjoint spanning trees and depth-first search , 1976, Acta Informatica.

[4]  Kunihiko Sadakane,et al.  Detecting Superbubbles in Assembly Graphs , 2013, WABI.

[5]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[6]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[7]  Dmitry Antipov,et al.  Assembling Single-Cell Genomes and Mini-Metagenomes From Chimeric MDA Products , 2013, J. Comput. Biol..

[8]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[9]  Volker Heun,et al.  Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE , 2006, CPM.

[10]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[11]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[13]  Serafim Batzoglou,et al.  Algorithmic challenges in mammalian whole‐genome assembly , 2005 .

[14]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[15]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[16]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[17]  Tetsuo Shibuya,et al.  An $\bm{O(m\, \log\, m)}$ -Time Algorithm for Detecting Superbubbles , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.