Spaced Seed Design Using Perfect Rulers

A widely used class of approximate pattern matching algorithms work in two stages, the first being a filtering stage that uses spaced seeds to quickly discards regions where a match is not likely to occur. The design of effective spaced seeds is known to be a hard problem. In this setting, we propose a family of lossless spaced seeds for matching with up to two errors based on mathematical objects known as perfect rulers. We analyze these seeds with respect to the tradeoff they offer between seed weight and the minimum length of the pattern to be matched. We identify a specific property of rulers, namely their skewness, which is closely related to the minimum pattern length of the derived seeds. In this context, we study in depth the specific case of Wichmann rulers and investigate the generalization of our approach to the larger class of unrestricted rulers. Although our analysis is mainly of theoretical interest, we show that for pattern lengths of practical relevance our seeds have a larger weight, hence a better filtration efficiency, than the ones known in the literature.

[1]  Ting Chen,et al.  PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds , 2009, Bioinform..

[2]  Bin Ma,et al.  Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler design , 2009, Inf. Process. Lett..

[3]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  W. Miller,et al.  A time-efficient, linear-space local similarity algorithm , 1991 .

[5]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[6]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[7]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[8]  B. Wichmann A Note on Restricted Difference Bases , 1963 .

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  C. D. Olds On the representations, $N_3 \left( {n^2 } \right)$ , 1941 .

[11]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[12]  Bin Ma,et al.  Seed Optimization Is No Easier than Optimal Golomb Ruler Design , 2007, APBC.

[13]  Giovanni Manzini,et al.  Spaced Seeds Design Using Perfect Rulers , 2011, SPIRE.

[14]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[15]  Bin Ma,et al.  On the complexity of the spaced seeds , 2007, J. Comput. Syst. Sci..

[16]  François Nicolas,et al.  Hardness of optimal spaced seed design , 2008, J. Comput. Syst. Sci..

[17]  Gad M. Landau,et al.  Optimal spaced seeds for faster approximate string matching , 2007, J. Comput. Syst. Sci..

[18]  John Leech,et al.  On the Representation of 1, 2, …, n by Differences , 1956 .

[19]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[20]  Giovanni Manzini,et al.  Better spaced seeds using Quadratic Residues , 2013, J. Comput. Syst. Sci..