Multiple seeds sensitivity using a single seed with threshold

Spaced seeds are a fundamental tool for similarity search in biosequences. The best sensitivity/selectivity trade-offs are obtained using many seeds simultaneously: This is known as the multiple seed approach. Unfortunately, spaced seeds use a large amount of memory and the available RAM is a practical limit to the number of seeds one can use simultaneously. Inspired by some recent results on lossless seeds, we revisit the approach of using a single spaced seed and considering two regions homologous if the seed hits in at least t sufficiently close positions. We show that by choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, we derive single seeds that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature. In addition, the choice of the threshold t can be adjusted to modify sensitivity and selectivity a posteriori, thus enabling a more accurate search in the specific instance at issue. The seeds we propose also exhibit robustness and allow flexibility in usage.

[1]  Travis Gagie,et al.  Compressed Spaced Suffix Arrays , 2014, ICABD.

[2]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[3]  Daniel G. Brown,et al.  A Survey of Seeding for Sequence Alignment , 2007 .

[4]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[5]  Giovanni Manzini,et al.  Better spaced seeds using Quadratic Residues , 2013, J. Comput. Syst. Sci..

[6]  Maxime Crochemore,et al.  The Gapped Suffix Array: A New Index Structure for Fast Approximate Matching , 2010, SPIRE.

[7]  Silvana Ilie Efficient computation of spaced seeds , 2011, BMC Research Notes.

[8]  Alexander Zelikovsky,et al.  Bioinformatics Algorithms: Techniques and Applications , 2008 .

[9]  Bin Ma,et al.  On the complexity of the spaced seeds , 2007, J. Comput. Syst. Sci..

[10]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[11]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[12]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[13]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[14]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[15]  François Nicolas,et al.  Hardness of optimal spaced seed design , 2008, J. Comput. Syst. Sci..

[16]  Steven J. Schwager Bonferroni Sometimes Loses , 1984 .

[17]  Gregory Kucherov,et al.  A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[18]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[19]  Lucian Ilie,et al.  SpEED: fast computation of sensitive spaced seeds , 2011, Bioinform..

[20]  Giovanni Manzini,et al.  Design and analysis of periodic multiple seeds , 2014, Theor. Comput. Sci..

[21]  Lucian Ilie,et al.  SHRiMP2: Sensitive yet Practical Short Read Mapping , 2011, Bioinform..