Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time

Genome assembly asks to reconstruct an unknown string from many shorter substrings of it. Even though it is one of the key problems in Bioinformatics, it is generally lacking major theoretical advances. Its hardness stems both from practical issues (size and errors of real data), and from the fact that problem formulations inherently admit multiple solutions. Given these, at their core, most state-of-the-art assemblers are based on finding non-branching paths (unitigs) in an assembly graph. If one defines a genome assembly solution as a closed arc-covering walk of the graph, then unitigs appear in all solutions, being thus safe partial solutions. All all such safe walks were recently characterized as omnitigs, leading to the first safe and complete genome assembly algorithm. Even if omnitig finding was improved to quadratic time, it remained open whether the crucial linear-time feature of finding unitigs can be attained with omnitigs. We describe a surprising $O(m)$-time algorithm to identify all maximal omnitigs of a graph with $n$ nodes and $m$ arcs, notwithstanding the existence of families of graphs with $\Theta(mn)$ total maximal omnitig size. This is based on the discovery of a family of walks (macrotigs) with the property that all the non-trivial omnitigs are univocal extensions of subwalks of a macrotig, with two consequences: (1) A linear-time output-sensitive algorithm enumerating all maximal omnitigs. (2) A compact $O(m)$ representation of all maximal omnitigs, which allows, e.g., for $O(m)$-time computation of various statistics on them. Our results close a long-standing theoretical question inspired by practical genome assemblers, originating with the use of unitigs in 1995. We envision our results to be at the core of a reverse transfer from theory to practical and complete genome assembly programs, as has been the case for other key Bioinformatics problems.

[1]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[2]  M. Zuker Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.

[3]  Alexandru I. Tomescu,et al.  Safe and Complete Contig Assembly Via Omnitigs , 2016, RECOMB.

[4]  Amir Abboud,et al.  Tight Hardness Results for LCS and Other Sequence Similarity Measures , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[5]  Tomasz Kociumaka,et al.  String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure , 2019, STOC.

[6]  David Tse,et al.  Near-optimal assembly for shotgun sequencing with noisy reads , 2014, BMC Bioinformatics.

[7]  Pawel Gawrychowski,et al.  Computing quartet distance is equivalent to counting 4-cycles , 2018, STOC.

[8]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[9]  Djamal Belazzougui,et al.  Linear time construction of compressed text indices in compact space , 2014, STOC.

[10]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[11]  K. Khrapko,et al.  [Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. A new method]. , 1988, Doklady Akademii nauk SSSR.

[12]  P. Hammer,et al.  Vertices Belonging to All or to No Maximum Stable Sets of a Graph , 1982 .

[13]  Evgeny Kapun,et al.  De Bruijn Superwalk with Multiplicities Problem is NP-hard , 2013, BMC Bioinformatics.

[14]  Bud Mishra,et al.  On Algorithmic Complexity of Biomolecular Sequence Assembly Problem , 2014, AlCoB.

[15]  Dominik Kempa,et al.  At the roots of dictionary compression: string attractors , 2017, STOC.

[16]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Giuseppe F. Italiano,et al.  Strong Connectivity in Directed Graphs under Failures, with Applications , 2015, SODA.

[18]  Paolo Ferragina,et al.  On the Bit-Complexity of Lempel-Ziv Compression , 2009, SIAM J. Comput..

[19]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[20]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[21]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[22]  Hans Söderlund,et al.  Algorithms for Some String Matching Problems Arising in Molecular Genetics , 1983, IFIP Congress.

[23]  Marie-Christine Costa Persistency in maximum cardinality bipartite matchings , 1994, Oper. Res. Lett..

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  Alexandru I. Tomescu,et al.  Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing , 2015 .

[26]  E. Holmes,et al.  A new coronavirus associated with human respiratory disease in China , 2020, Nature.

[27]  Joshua R. Wang,et al.  Finding Four-Node Subgraphs in Triangle Time , 2015, SODA.

[28]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[29]  Kun-Mao Chao,et al.  Locating well-conserved regions within a pairwise alignment , 1993, Comput. Appl. Biosci..

[30]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[31]  Mihai Pop,et al.  Assembly complexity of prokaryotic genomes using short reads , 2010, BMC Bioinformatics.

[32]  Alexandru I. Tomescu,et al.  A safe and complete algorithm for metagenomic assembly , 2018, Algorithms for Molecular Biology.

[33]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[34]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[35]  Heng Li,et al.  Fast and accurate long-read assembly with wtdbg2 , 2019, Nature Methods.

[36]  Alexandru I. Tomescu,et al.  Safe and Complete Contig Assembly Through Omnitigs , 2017, J. Comput. Biol..

[37]  A. Friemann,et al.  A new approach for displaying identities and differences among aligned amino acid sequences , 1992, Comput. Appl. Biosci..

[38]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[39]  Paul Medvedev,et al.  Modeling Biological Problems in Computer Science: A Case Study in Genome Assembly , 2017, Briefings Bioinform..

[40]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[41]  Roberto Grossi,et al.  On the Complexity of String Matching for Graphs , 2023, ICALP.

[42]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[43]  David Eppstein k -Best Enumeration , 2015 .

[44]  P. Argos,et al.  Determination of reliable regions in protein sequence alignments. , 1990, Protein engineering.

[45]  Eugene W. Myers,et al.  Computability of Models for Sequence Assembly , 2007, WABI.

[46]  Srinivas Aluru,et al.  Parallel methods for short read assembly , 2009 .

[47]  Piotr Indyk,et al.  Which Regular Expression Patterns Are Hard to Match? , 2015, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[48]  J. Kececioglu Exact and approximation algorithms for DNA sequence reconstruction , 1992 .

[49]  Simon J. Puglisi,et al.  Range Predecessor and Lempel-Ziv Parsing , 2016, SODA.

[50]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[51]  Alexandru I. Tomescu,et al.  An Optimal O(nm) Algorithm for Enumerating All Walks Common to All Closed Edge-covering Walks of a Graph , 2019, ACM Trans. Algorithms.

[52]  Alexandru I. Tomescu,et al.  Genome assembly, a universal theoretical framework: unifying and generalizing the safe and complete algorithms , 2020, ArXiv.

[53]  Alain Guénoche Can we recover a sequence, just knowing all its subsequences of given length? , 1992, Comput. Appl. Biosci..

[54]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[55]  Katarina Cechlárová,et al.  Persistency in the assignment and transportation problems , 1998, Math. Methods Oper. Res..