Long spaced seeds for finding similarities between biological sequences

Homology search finds similar segments between two biological sequences, such as DNA or protein sequences. A significant fraction of the computing power in the world is devoted to finding similarities between biological sequences. The introduction of optimal spaced seeds in [Ma et al., Bioinformatics 18 (2002) 440–445] has increased both the sensitivity and the speed of homology search and it has been adopted by many alignment programs such as BLAST. In spite of significant amount of work, there are no algorithms able to compute long good seeds. We present a different approach here by introducing a new measure that has two desired properties: (i) it is highly correlated with sensitivity of spaced seeds and (ii) it is easily computable. Using this measure we give algorithms that compute better seeds than all previous ones. The fact that sensitivity is not required is essential as it enables us to compute very long good seeds, far beyond the size for which sensitivity can be computed.

[1]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[2]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[3]  Gene H. Golub,et al.  Matrix computations , 1983 .

[4]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[5]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[6]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[7]  Bin Ma,et al.  Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[8]  Yann Ponty,et al.  Estimating seed sensitivity on homogeneous alignments , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[9]  Franco P. Preparata,et al.  Quick, Practical Selection of Effective Seeds for Homology Search , 2005, J. Comput. Biol..

[10]  Gregory Kucherov,et al.  YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[11]  Pavel A. Pevzner,et al.  Multiple filtration and approximate pattern matching , 1995, Algorithmica.

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[13]  Louxin Zhang,et al.  Sensitivity analysis and efficient method for identifying optimal spaced seeds , 2004, J. Comput. Syst. Sci..

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Robert M. Corless,et al.  Essential Maple: An Introduction for Scientific Programmers , 1995 .

[16]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[17]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Bioinform..

[18]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[19]  Ming Li,et al.  Superiority and complexity of the spaced seeds , 2006, SODA 2006.

[20]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[21]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[22]  Bin Ma,et al.  tPatternHunter: gapped, fast and sensitive translated homology search , 2005, Bioinform..

[23]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Homologous Coding Regions , 2004, J. Bioinform. Comput. Biol..

[24]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[25]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[26]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..