A unifying framework for seed sensitivity and its application to subset seeds

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem--a set of target alignments, an associated probability distribution, and a seed model--that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.

[1]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[2]  Daniel G. Brown,et al.  Multiple Vector Seeds for Protein Alignment , 2004, WABI.

[3]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[4]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[5]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions , 2003, CPM.

[6]  Daniel G. Brown,et al.  Vector Seeds: An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specifity , 2003, WABI.

[7]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[8]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Homologous Coding Regions , 2004, J. Bioinform. Comput. Biol..

[9]  Daniel G. Brown Optimizing Multiple Seeds for Protein Homology Search , 2005, TCBB.

[10]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[11]  Kun-Mao Chao,et al.  Efficient methods for generating optimal single and multiple spaced seeds , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[12]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[13]  Wei Chen,et al.  On half gapped seed. , 2003, Genome informatics. International Conference on Genome Informatics.

[14]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[15]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Daniel G. Brown,et al.  Optimizing multiple seeds for protein homology search , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Bin Ma,et al.  Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[18]  Yann Ponty,et al.  Estimating seed sensitivity on homogeneous alignments , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[19]  A V Finkelstein,et al.  Computation of biopolymers: a general approach to different problems. , 1993, Bio Systems.

[20]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[21]  Gregory Kucherov,et al.  Improved hit criteria for DNA local alignment , 2004, BMC Bioinformatics.

[22]  Louxin Zhang,et al.  Sensitivity analysis and efficient method for identifying optimal spaced seeds , 2004, J. Comput. Syst. Sci..

[23]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[24]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .