Spaced Seeds Design Using Perfect Rulers

We consider the problem of lossless spaced seed design for approximate pattern matching. We show that, using mathematical objects known as perfect rulers, we can derive a family of spaced seeds for matching with up to two errors. We analyze these seeds with respect to the trade-off they offer between seed weight and the minimum length of the pattern to be matched. We prove that for patterns of length up to a few hundreds our seeds have a larger weight, hence a better filtration efficiency, than the ones known in the literature. In this context, we study in depth the specific case of Wichmann rulers and prove some preliminary results on the generalization of our approach to the larger class of unrestricted rulers.

[1]  Bin Ma,et al.  On the complexity of the spaced seeds , 2007, J. Comput. Syst. Sci..

[2]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Bin Ma,et al.  PatternHunter II: highly sensitive and fast homology search. , 2003, Genome informatics. International Conference on Genome Informatics.

[4]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[5]  Wolfgang Spohn,et al.  The Representation of , 1986 .

[6]  I. S. Gál On the Representation of 1, 2, . . . , N by Differences , 2004 .

[7]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[8]  François Nicolas,et al.  Hardness of optimal spaced seed design , 2008, J. Comput. Syst. Sci..

[9]  Gad M. Landau,et al.  Optimal spaced seeds for faster approximate string matching , 2007, J. Comput. Syst. Sci..

[10]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[11]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[12]  B. Wichmann A Note on Restricted Difference Bases , 1963 .

[13]  Bin Ma,et al.  Seed Optimization Is No Easier than Optimal Golomb Ruler Design , 2007, APBC.