论文信息 - Estimating Sequence Similarity from Contig Sets

Estimating Sequence Similarity from Contig Sets

A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.

Petr Rysavý | Filip Zelezný | F. Železný | P. Ryšavý

[1] Éva Tardos,et al. Algorithm design , 2005 .

[2] David Hernández,et al. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[3] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[4] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[5] Steven J. M. Jones,et al. Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[6] René L. Warren,et al. Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[7] Dmitry Antipov,et al. Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads , 2013, RECOMB.

[8] Petr Rysavý,et al. Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data , 2016, IDA.

[9] Leping Li,et al. ART: a next-generation sequencing read simulator , 2012, Bioinform..

[10] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[11] E. Birney,et al. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[12] Enrique Vidal,et al. Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13] C. Mallows,et al. A Method for Comparing Two Hierarchical Clusterings , 1983 .

[14] Petr Rysavý,et al. Estimating sequence similarity from read sets for clustering next-generation sequencing data , 2018, Data Mining and Knowledge Discovery.