Optimizing Multiple Seeds for Protein Homology Search

We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NP-hard, and Quasi-NP-hard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP.

[1]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[2]  Bin Ma,et al.  Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[3]  Bin Ma,et al.  tPatternHunter: gapped, fast and sensitive translated homology search , 2005, Bioinform..

[4]  Gregory Kucherov,et al.  Multi-seed Lossless Filtration (Extended Abstract) , 2004, CPM.

[5]  Louxin Zhang,et al.  Sensitivity analysis and efficient method for identifying optimal spaced seeds , 2004, J. Comput. Syst. Sci..

[6]  Daniel G. Brown,et al.  Multiple Vector Seeds for Protein Alignment , 2004, WABI.

[7]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[8]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Homologous Coding Regions , 2004, J. Bioinform. Comput. Biol..

[9]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2005, J. Comput. Syst. Sci..

[10]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[11]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Bioinform..

[13]  Miklós Csűrös,et al.  Performing Local Similarity Searches with Variable Length Seeds , 2004 .

[14]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  Daniel G. Brown,et al.  New Algorithms for Multiple DNA Sequence Alignment , 2004, WABI.

[17]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[18]  Yann Ponty,et al.  Estimating seed sensitivity on homogeneous alignments , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[19]  Jeremy Buhler,et al.  Designing Multiple Simultaneous Seeds for DNA Similarity Search , 2005, J. Comput. Biol..