Compact Encoding Strategies for DNA Sequence Similarity Search

Determining whether two DNA sequences are similar is an essential component of DNA sequence analysis. Dynamic programming is the algorithm of choice if computational time is not the most important consideration. Heuristic search tools, such as BLAST, are computationally more efficient, but they may miss some of the sequence similarities (Altschul et al., 1990). These tools often use common k-tuples (words) between the two sequences to determine anchor points for the alignment, and spend most of their computational time extending the alignment beyond these anchor points. We discuss and provide a DNA sequence similarity search implementation (called SENSEI) that improves upon the performance of BLASTN by almost an order of magnitude for comparable sensitivity. This improvement is a result of using compactly encoded scoring tables for k-tuples, encoding bases with a single bit, filtering the sequence to remove the simple sequence repeats using XNUN, and masking the known species-specific repeats in the query sequence. To reduce memory requirements, especially for large genomic DNA query sequences, we recommend generating the neighborhood words from the target sequence at run-time, instead of generating them by preprocessing the query sequence.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  Methods : A Companion to Methods in Enzymology , 2022 .

[6]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[7]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[9]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[10]  C. Luo,et al.  A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. , 1985, Molecular biology and evolution.

[11]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[12]  Pankaj Agarwal,et al.  The Repeat Pattern Toolkit (RPT): Analyzing the Structure and Evolution of the C. elegans Genome , 1994, ISMB.

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[15]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[16]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.