Optimal spaced seeds for faster approximate string matching

Filtering is a standard technique for fast approximate string matching in practice.In filtering, a quick first step is used to rule out almost all positions of a text as possible starting positions for a pattern. Typically this step consists of finding the exact matches of small parts of the pattern. In the followup step, a slow method is used to verify or eliminate each remaining position. The running time of such a method depends largely on the quality of the filtering step, as measured by its false positives rate. The quality of such a method depends on the number of true matches that it misses, that is, on its false negative rate.

[1]  M. Karpinski,et al.  Approximating dense cases of covering problems , 1996, Network Design: Connectivity and Facilities Location.

[2]  Pavel A. Pevzner,et al.  Multiple filtration and approximate pattern matching , 1995, Algorithmica.

[3]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Jeremy Buhler,et al.  Provably sensitive Indexing strategies for biosequence similarity search , 2002, RECOMB '02.

[5]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[6]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[7]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[8]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[9]  Bin Ma,et al.  Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[10]  Gregory Kucherov,et al.  Multi-seed Lossless Filtration (Extended Abstract) , 2004, CPM.

[11]  Yann Ponty,et al.  Estimating seed sensitivity on homogeneous alignments , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[12]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[13]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[14]  Daniel G. Brown,et al.  Vector Seeds: An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specifity , 2003, WABI.

[15]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[16]  Bin Ma,et al.  Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[17]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  Uzi Vishkin,et al.  Efficient approximate and dynamic matching of patterns using a labeling paradigm , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[20]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[21]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[22]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.