论文信息 - Multiple seeds sensitivity using a single seed with threshold

Multiple seeds sensitivity using a single seed with threshold

Spaced seeds are a fundamental tool for similarity search in biosequences. The best sensitivity/selectivity trade-offs are obtained using many seeds simultaneously: This is known as the multiple seed approach. Unfortunately, spaced seeds use a large amount of memory and the available RAM is a practical limit to the number of seeds one can use simultaneously. Inspired by some recent results on lossless seeds, we revisit the approach of using a single spaced seed and considering two regions homologous if the seed hits in at least t sufficiently close positions. We show that by choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, we derive single seeds that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature. In addition, the choice of the threshold t can be adjusted to modify sensitivity and selectivity a posteriori, thus enabling a more accurate search in the specific instance at issue. The seeds we propose also exhibit robustness and allow flexibility in usage.

Lavinia Egidi | Giovanni Manzini

[1] Travis Gagie,et al. Compressed Spaced Suffix Arrays , 2014, ICABD.

[2] Bin Ma,et al. PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[3] Daniel G. Brown,et al. A Survey of Seeding for Sequence Alignment , 2007 .

[4] Jeremy Buhler,et al. Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[5] Giovanni Manzini,et al. Better spaced seeds using Quadratic Residues , 2013, J. Comput. Syst. Sci..

[6] Maxime Crochemore,et al. The Gapped Suffix Array: A New Index Structure for Fast Approximate Matching , 2010, SPIRE.

[7] Silvana Ilie. Efficient computation of spaced seeds , 2011, BMC Research Notes.

[8] Alexander Zelikovsky,et al. Bioinformatics Algorithms: Techniques and Applications , 2008 .

[9] Bin Ma,et al. On the complexity of the spaced seeds , 2007, J. Comput. Syst. Sci..

[10] J. Shane Culpepper,et al. Efficient set intersection for inverted indexing , 2010, TOIS.

[11] Bin Ma,et al. Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[12] S. Nelson,et al. BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[13] Bin Ma,et al. On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[14] Daniel G. Brown,et al. Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[15] François Nicolas,et al. Hardness of optimal spaced seed design , 2008, J. Comput. Syst. Sci..

[16] Steven J. Schwager. Bonferroni Sometimes Loses , 1984 .

[17] Gregory Kucherov,et al. A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[18] Juha Kärkkäinen,et al. Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[19] Lucian Ilie,et al. SpEED: fast computation of sensitive spaced seeds , 2011, Bioinform..

[20] Giovanni Manzini,et al. Design and analysis of periodic multiple seeds , 2014, Theor. Comput. Sci..

[21] Lucian Ilie,et al. SHRiMP2: Sensitive yet Practical Short Read Mapping , 2011, Bioinform..