Detecting Superbubbles in Assembly Graphs

We introduce a new concept of a subgraph class called a superbubble for analyzing assembly graphs, and propose an efficient algorithm for detecting it. Most assembly algorithms utilize assembly graphs like the de Bruijn graph or the overlap graph constructed from reads. From these graphs, many assembly algorithms first detect simple local graph structures (motifs), such as tips and bubbles, mainly to find sequencing errors. These motifs are easy to detect, but they are sometimes too simple to deal with more complex errors. The superbubble is an extension of the bubble, which is also important for analyzing assembly graphs. Though superbubbles are much more complex than ordinary bubbles, we show that they can be efficiently enumerated. We propose an average-case linear time algorithm (i.e., O(n + m) for a graph with n vertices and m edges) for graphs with a reasonable model, though the worst-case time complexity of our algorithm is quadratic (i.e., O(n(n + m))). Moreover, the algorithm is practically very fast: Our experiments show that our algorithm runs in reasonable time with a single CPU core even against a very large graph of a whole human genome.

[1]  森下 真一,et al.  Large-scale genome sequence processing , 2006 .

[2]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[4]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[5]  Dmitry Antipov,et al.  Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads , 2013, RECOMB.

[6]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[7]  Xiaoqiu Huang,et al.  Generating a Genome Assembly with PCAP , 2005, Current protocols in bioinformatics.

[8]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[9]  T. Shibuya,et al.  Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes , 2012, BMC Research Notes.

[10]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[11]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[12]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[13]  Srinivas Aluru,et al.  Parallel de novo assembly of large genomes from high-throughput short reads , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[15]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[16]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.