Superbubbles, Ultrabubbles, and Cacti

A superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats [e.g., variant cell format (VCF)].

[1]  Costas S. Iliopoulos,et al.  Popping Superbubbles and Discovering Clumps: Recent Developments in Biological Sequence Analysis , 2016, WALCOM.

[2]  Pavel A. Pevzner,et al.  Computational molecular biology : an algorithmic approach , 2000 .

[3]  F Harary,et al.  On the Number of Husimi Trees: I. , 1953, Proceedings of the National Academy of Sciences of the United States of America.

[4]  P. Pevzner,et al.  Breakpoint graphs and ancestral genome reconstructions. , 2009, Genome research.

[5]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[6]  Tetsuo Shibuya,et al.  An $\bm{O(m\, \log\, m)}$ -Time Algorithm for Detecting Superbubbles , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Kunihiko Sadakane,et al.  Detecting Superbubbles in Assembly Graphs , 2013, WABI.

[8]  Costas S. Iliopoulos,et al.  Theoretical Linear-time superbubble identification algorithm for genome assembly , 2015 .

[9]  Roberto Grossi,et al.  Efficient Bubble Enumeration in Directed Graphs , 2012, SPIRE.

[10]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[11]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[12]  Jack Edmonds,et al.  Matching: A Well-Solved Class of Integer Linear Programs , 2001, Combinatorial Optimization.

[13]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[14]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.