Estimating Sequence Similarity from Contig Sets

A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.

[1]  Éva Tardos,et al.  Algorithm design , 2005 .

[2]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[3]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[4]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[5]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[6]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[7]  Dmitry Antipov,et al.  Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads , 2013, RECOMB.

[8]  Petr Rysavý,et al.  Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data , 2016, IDA.

[9]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[10]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[11]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[12]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[14]  Petr Rysavý,et al.  Estimating sequence similarity from read sets for clustering next-generation sequencing data , 2018, Data Mining and Knowledge Discovery.